Scrape web content with 2 Popular Python packages

Information is wealth. Today there is tremendous amount of data online. There are thousands of crawlers scrape web daily to collect information and process them. In this article we see 2 popular tools to easily extract content from the webpages.

Why scrape web

We have lot of data online in the form of websites. Websites include HTML content and JavaScript. We need specific parsers to parse the HTML content and extract information into meaningful buckets. For example, reviews of a product. A review contain information about review date, user, product, sentiment (whether the review is negative or positive), intent of the user and rating. The review is presented in website with design style (CSS). We need to remove all those styles, HTML tags and get the content, in this case, plain text.

How to scrape web?

Let's look at a simple example. We will take simple review from tripadvisor. Below is the image of the review.

Review Image

In the above review we have the following information:

  1. Date of the review - 10 March 2020
  2. rating - 5 stars (as all are green)
  3. User - username bornaz2018
  4. User avatar - Image above the user name
  5. review heading - Great!
  6. review text - Best service ever. Food was …

Now we will see the HTML content for just the username and user avatar. I am using Chrome browser to inspect element. Right click on the review and click “Inspect” in the menu. You will see the below screen. Move the mouse over the HTML Div content and you will see the content highlighted on the left as shown below:

HTML content

You see the “div” element highlighted on the right and the respective review on the left. Now we will navigate into the ‘div’ and get specific elements.

Member section content

The above screenshot shows the ‘div’ associated with the member section. If we copy the complete element, you get the following HTML code:

We are interested in getting user avatar and user name. So, we need the following div sections:

Now we got the HTML, it's time to extract the information from them.

Using Beautifulsoup

We use Beautifulsoup Python package to scrape web page. This is the most easiest and simple way to navigate through HTML. To install packages use the following PIP command (create a virtual environment for clean code):

$ pip3 install beautifulsoup4 lxml

lxml package is required for parsing HTML. Below is the code snippet to get the avatar image and name of the user. Using the below example we got the image as base64 encoded data and name as string.

To know more about Beautifulsoup refer to its documentation page.

Using Splinter

We will use Splinter Python package to scrape web content. Splinter is used mostly for web automation testing. Read my article to know more about Splinter. In the below example we use Splinter to scrape the data from the web page. Using the below example we got the image as base64 encoded data and name as string.

Conclusion

We use Beautifulsoup and Splinter in our projects. With Splinter you can perform actions, like, clicks, scrolling etc on the webpage as you have a live browser instance. Beautifulsoup is fast when you have the raw HTML downloaded. Beautifulsoup is use with requests package to connect to the webpage.

comments powered by Disqus