Friday, May 20, 2016

Web Scraping for Beginners

With the eCommerce boom, I have become a fan of price comparison apps in recent years. Each purchase I make online (or even offline) is the result of a thorough investigation across sites offering the product.

Some of the apps I use include RedLaser, ShopSavvy and BuyHatke, which have been doing great work in increasing transparency and saving the time of consumers.

Have you ever wondered how these apps get that important data? In most cases, the process employed by the apps is web scraping.

Web Scraping Defined

Web scraping is the process of extracting data on the web. With the right tools, anything that's visible to you can be extracted. In this post, we'll focus on writing programs that automate this process and help you gather huge amounts of data in a relatively short time. Apart from the example I've already given, scraping has a lot of uses like SEO tracking, job tracking, news analysis, and --- my favorite --- sentiment analysis on social media!

A note of caution

Before you go on a web scraping adventure, make sure you're aware of the legal issues involved. Many websites specifically prohibit scraping in their terms of service. For example, to quote Medium, "Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited." Scraping sites that do not allow scraping might actually get you blacklisted from them! Just like any other tool, web scraping can be used for for reasons like copying the content of other sites. Scraping has led to many lawsuits too.

Setting Up the Code

Now that you know that we must tread carefully, let's get into scraping. Scraping can be done in any programming language, and we covered it for Node some time back. In this post, we're going to use Python for the simplicity of the language and the availability of packages that make the process easy.

What's the underlying process?

When you're accessing a site on the Internet, you're essentially downloading HTML code, which is analyzed and displayed by your web browser. This HTML code contains all the information that's visible to you. Therefore, the required information (like the price) can be obtained by analyzing this HTML code. You can use regular expressions to search for your needle in the haystack, or use a library to parse the HTML and get the required data.

In Python, we're going to use a module called Beautiful Soup to analyze this HTML data. You can install the module through an installer like pip by running the following command:

pip install beautifulsoup4

Alternately, you can build it from the source. The installation steps are listed on the module's documentation page.

After getting that installed, we'll broadly follow the following steps:

  • send a request to URL
  • receive the response
  • analyze the response to find required data.

For demonstration purposes, we'll use my blog http://ift.tt/1rUD8BF.

The first two steps are fairly simple, and can be accomplished as follows:

from urllib import urlopen

#Sending the http request
webpage = urlopen('http://my_website.com/').read()

Next, we need to provide the response to

from bs4 import BeautifulSoup
#making the soup! yummy ;)
soup = BeautifulSoup(webpage, "html5lib")

Notice that we used html5lib as our parser. You may install a different parser for BeautifulSoup as mentioned in their documentation.

Continue reading %Web Scraping for Beginners%


by Shaumik Daityari via SitePoint

No comments:

Post a Comment