We’re going to use scrape review data from Tripadvisor, which would be helpful if you wanted to do a NLP analysis of reviews as a way to keep your finger on the pulse of customer satisfaction.
Selenium is a tool that lets you control your desktop browser programmatically. Tripadvisor requires a “read more” button to be clicked to reveal the entire review, and Selenium is a good tool to easily get this done.
We’ll assume you’re using Anaconda for environment management. First, activate an environment with
conda activate [environment]
Install Selenium with:
conda install -c conda-forge selenium
The first thing we’ll do is import the required packages. There aren’t many:
import csv #This package lets us save data to a csv file
from selenium import webdriver #The Selenium package we'll need
import time #This package lets us pause execution for a bit
We’ll need to set up some variables.
First, define a path and file to store data. Here I’m just putting it on my Mac’s desktop
path_to_file = "/Users/chris/Desktop/HotelReviews.csv"
Next, define the number of review pages to scrape. There might be hundreds of pages of reviews, depending on the property. For this demonstration we’ll keep it easy and just grab three pages.
Next, define the Tripadvisor URL. We’ll use the Hilton Waikiki resort page, because it’s a super nice hotel!
url = "https://www.tripadvisor.com/Hotel_Review-g60982-d209422-Reviews-Hilton_Waikiki_Beach-Honolulu_Oahu_Hawaii.html"
Get Selenium on the right page and set up the CSV
Let’s launch Selenium and have it point the browser to the URL we’re using. This example uses the Safari browser, but you can remote control Chrome (or any other browser) if you prefer.
# import the webdriver
driver = webdriver.Safari()
We’ll also get a CSV set up to contain the data we retrieve
# open the file to save the review
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile)
Programatically grab the reviews, one after the other
Now it’s time to tell Selenium to grab the reviews. Because Tripadvisor reviews are truncated, we’ll be sure and click the “read more” button for all reviews first.
# change the value inside the range to save the number of reviews we're going to grab
for i in range(0, pages_to_scrape):
# give the DOM time to load
# Click the "expand review" link to reveal the entire review.
# Now we'll ask Selenium to look for elements in the page and save them to a variable. First lets define a container that will hold all the reviews on the page. In a moment we'll parse these and save them:
container = driver.find_elements_by_xpath("//div[@data-reviewid]")
# Next we'll grab the date of the review:
dates = driver.find_elements_by_xpath(".//div[@class='_2fxQ4TOx']")
# Now we'll look at the reviews in the container and parse them out
for j in range(len(container)): # A loop defined by the number of reviews
# Grab the rating
rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")
# Grab the title
title = container[j].find_element_by_xpath(".//div[contains(@data-test-target, 'review-title')]").text
#Grab the review
review = container[j].find_element_by_xpath(".//q[@class='IRsGHoPm']").text.replace("\n", " ")
#Grab the data
date = " ".join(dates[j].text.split(" ")[-2:])
#Save that data in the csv and then continue to process the next review
csvWriter.writerow([date, rating, title, review])
# When all the reviews in the container have been processed, change the page and repeat
driver.find_element_by_xpath('.//a[@class="ui_button nav next primary "]').click()
# When all pages have been processed, quit the driver
That’s it! A very simple program. You’ll now have a csv that contains all the reviews you retreived, looking something like this: