It is very common to have a website that contains data you need to analyze, but usually websites present data in HTML format, which can be difficult to work with. Manually copying and pasting into spreadsheets might work if the data set is small, but it will be frustrating and time consuming to use the same technique for bigger amounts of data.
A preferred method to extract information from any website is to use an API. Most large websites provide access to their information through APIs, but this is not always the case for other websites. This is where scraping comes in.
Web scraping is an automated technique used to crawl websites and extract content from them. But before discussing the technical aspects, I need to mention that scraping a website must adhere to a website’s terms and conditions and legal use of data.
I chose Python for this tutorial because of its ease of use and rich ecosystem. There are many libraries that can be used for scraping purposes, but I will use “BeautifulSoup” as a parser and “urllib” as a URL fetcher to walk you through the easiest way to implement a web scraper.
Inspecting the Page
Building a scraper will be an adaptable process that takes layout modifications and website structure into account, this is not a onetime task.
I chose a GPI blog as a source of information for my scraper. As you inspect the HTML code shown in the screenshot below, it turns out that the Div that has all the blog information is <div id=”article”> and each blog item is under <dl class=”blogItem”>.
Let’s start by importing the libraries we are going to use for this task.
Since the GPI blog has many pages, it’s better to prompt for number of pages to scrape to keep it minimal. This is a good option if you don’t want to go too aggressive on any website, which could get you banned as a spammer.
Once we have the number of blog pages we need to scrape, content retrieval will be simple. This will retrieve the blog URL, title and publish date and print it to your console. In a real scenario, you would be interested in getting data into a well-structured more tabular format, Pandas DataFrame is likely to be used.
A DataFrame is an object that stores data in a tabular format, which facilitates data analysis.
Below is how the final code looks:
Here is a preview of the output file:
Building a web scraper in Python is relatively easy and can be accomplished in a few lines of code. Web scraping in general is a fragile approach, though. It is reliable if used with well-structured web pages with static informative HTML tag attributes. APIs (if provided by a website) are the preferred approach since they are less likely to break.