How to scrape Google reviews using Python
This post explains how to scrape Google reviews using Python. We use two popular Python packages extensively used for scraping, BeautifulSoup and Splinter. Though the process or the code snippets can be used to scrape reviews for any business listed in Google, but I am more interested in scraping reviews for restaurants in US. So, Let’s get started.
Identify businesses to scrape
First thing we need to do is to identity the businesses we want to scrape. As I mentioned earlier, I am interested in restaurants in US. Let’s stick to restaurants in New York state. Below is the URL and the image of the business, Breslin Berger we want to scrape.
We would like to get the following information from the above page (RED markings):
- Image of the restaurant
- Restaurant name
- Overall rating and number of reviews
- Type of the restaurant
- Location of the restaurant
Clicking on reviews
will take you to the reviews page.
We want the latest reviews, so we sort the reviews by
latest
and then scrape them.
Scraping business information
Google uses dynamic ids and class names in it’s pages which make scraping a bit complex. We use different attributes other than ids and classes to get the elements. We use Splinter
to load the page as it contains JavaScript. Get the html
from Splinter Browser
component and use BeautifulSoup
to scrape the content. The following code snippet will scrape the business information:
def get_business_info(url):
business = {}
business['base_url'] = url
browser = Browser("chrome", headless=False)
browser.visit(url)
time.sleep(5)
soup = BeautifulSoup(browser.html, "html5lib")
img_div = soup.select_one('div[class*="section-hero-header-image-hero-container"]')
img = img_div.find("img").get("src")
business['img'] = img
addr_button = soup.select_one('button[data-item-id="address"]')
addr = addr_button['aria-label'].replace("Address: ", "").strip()
business['address'] = addr
business_name = soup.select_one('h1[class*="header-title-title"]').text.strip()
business['name'] = business_name
rating = soup.find("ol", {"class":"section-star-array"})['aria-label'].replace("stars", "").strip()
business['rating'] = float(rating)
total_reviews = soup.select_one('button:-soup-contains("reviews")').text
business['review_count'] = int(total_reviews.replace("reviews", "").strip().replace(",",""))
res_type = soup.select_one('div[class="gm2-body-2"]').text
business['type'] = res_type.replace("·","")
business['source'] = "Google"
reviews_button = browser.find_by_text(total_reviews).click()
time.sleep(3)
business['reviews_url'] = browser.url
browser.quit()
return business
I have used select_one
method from BeautifulSoup which finds only the first element that matches the selector. The beauty of this method is we can look for partial selectors as well. This method uses Pseudo-Classes
. I have used :-soup-contains
in the above code to get the button element that has reviews
in the text. Refer to this link to know more about Pseudo-Classes.
The output of the above method is a python dictionary:
{
"base_url": "https://www.google.com/maps/place/Breslin+Burger/@40.745568,-73.990283,17z/data=!3m1!4b1!4m5!3m4!1s0x89c259a61b874f57:0xc5fd887f907ea722!8m2!3d40.745568!4d-73.988089",
"img": "https://lh5.googleusercontent.com/p/AF1QipOQV1HUtfhIRxtict4owToQkQpQHbCJZ_KsW83e=w491-h240-k-no",
"address": "16 W 29th St, New York, NY 10001, United States",
"name": "Breslin Burger",
"rating": 4.3,
"review_count": 1125,
"type": "Hamburger restaurant",
"source": "Google",
"reviews_url": "https://www.google.com/maps/place/Breslin+Burger/@40.745568,-73.9902777,17z/data=!4m7!3m6!1s0x89c259a61b874f57:0xc5fd887f907ea722!8m2!3d40.745568!4d-73.988089!9m1!1b1"
}
Scraping the reviews page
We have to sort the reviews by latest. We load the page in Splinter Browser component and do the click actions to sort the reviews. Once the page loads with sorted by latest reviews we will scroll the page until we get the desired number of reviews. Google uses dynamic loading of reviews, so we have to scroll down to load the next set of reviews. I just need 300 latest reviews, ofcourse, count is a parameter to the scraping function.
The following code snippet will scroll the page until we get the desired number of reviews:
def get_html(url, count):
browser = Browser("chrome", headless=False)
browser.visit(url)
time.sleep(2)
# sort and select newest for the list
browser.find_by_text("Sort").first.click()
time.sleep(2)
new_menu_item = browser.find_by_id("action-menu").find_by_tag("ul").find_by_tag("li")[1]
new_menu_item.click()
time.sleep(7)
rlen = get_review_count(browser.html)
while rlen < count:
browser.execute_script('document.querySelector("div.section-layout.section-scrollbox").scrollTop = document.querySelector("div.section-layout.section-scrollbox").scrollHeight')
time.sleep(2)
rlen = get_review_count(browser.html)
html = browser.html
browser.quit()
return html
def get_review_count(html):
soup = BeautifulSoup(html, "html5lib")
reviews = soup.find_all('div', {'data-review-id': True, 'aria-label': True})
return len(reviews)
The following code will do the scraping of the loaded reviews:
def get_reviews(html):
soup = BeautifulSoup(html, "html5lib")
reviews = soup.find_all('div', {'data-review-id': True, 'aria-label': True})
for r in reviews:
user = r['aria-label'].encode('ascii', 'ignore').decode('UTF-8')
review_id = r['data-review-id']
content_div = r.find("div", {'data-review-id': review_id})
stars = content_div.find("span", {'role':'img'})['aria-label'].strip()
rating = int(stars.split(' ')[0])
date = content_div.select_one('span[class*="-date"]').text.strip()
text = content_div.select_one('span[class*="-text"]').text.strip()
if len(text) == 0:
continue
else:
if "(Translated by Google)" in text:
text = text.replace("(Translated by Google) ", "")
if "(Original)" in text:
idx = text.index("(Original)")
text = text[0:idx]
yield {
"rating": rating,
"id": review_id,
"user": user,
"date": dateparser.parse(date).strftime("%d-%m-%Y"),
"review": text.replace("\n",'').encode('ascii', 'ignore').decode('UTF-8')
}
Conclusion
We have learnt how to use BeautifulSoup
and Splinter
Python packages to scrape Google reviews. The complete scraper code is available from my Github repo. In the future articles we will see how to analyze these reviews to extract some important insights about the business. Thanks for reading.