Information is wealth. Today there is tremendous amount of data online. There are thousands of crawlers scrape web daily to collect information and process them. In this article we see 2 popular tools to easily extract content from the webpages.
Why scrape web
How to scrape web?
Let's look at a simple example. We will take simple review from tripadvisor. Below is the image of the review.
In the above review we have the following information:
- Date of the review - 10 March 2020
- rating - 5 stars (as all are green)
- User - username bornaz2018
- User avatar - Image above the user name
- review heading - Great!
- review text - Best service ever. Food was …
Now we will see the HTML content for just the username and user avatar. I am using Chrome browser to inspect element. Right click on the review and click “Inspect” in the menu. You will see the below screen. Move the mouse over the HTML Div content and you will see the content highlighted on the left as shown below:
You see the “div” element highlighted on the right and the respective review on the left. Now we will navigate into the ‘div’ and get specific elements.
The above screenshot shows the ‘div’ associated with the member section. If we copy the complete element, you get the following HTML code:
We are interested in getting user avatar and user name. So, we need the following div sections:
Now we got the HTML, it's time to extract the information from them.
We use Beautifulsoup Python package to scrape web page. This is the most easiest and simple way to navigate through HTML. To install packages use the following PIP command (create a virtual environment for clean code):
$ pip3 install beautifulsoup4 lxml
lxml package is required for parsing HTML. Below is the code snippet to get the avatar image and name of the user. Using the below example we got the image as base64 encoded data and name as string.
To know more about Beautifulsoup refer to its documentation page.
We will use Splinter Python package to scrape web content. Splinter is used mostly for web automation testing. Read my article to know more about Splinter. In the below example we use Splinter to scrape the data from the web page. Using the below example we got the image as base64 encoded data and name as string.
We use Beautifulsoup and Splinter in our projects. With Splinter you can perform actions, like, clicks, scrolling etc on the webpage as you have a live browser instance. Beautifulsoup is fast when you have the raw HTML downloaded. Beautifulsoup is use with requests package to connect to the webpage.