When I first started web scraping with BeautifulSoup4, I found that the most difficult hoop to jump through was pagination. Getting the elements from a static page seemed fairly straightforward — but what if the data I wanted was not on the initial page I loaded into my script? In this project we will try our hand at pagination using Selenium to cycle through the pages of an Amazon results page, and saving all of the data in a .jsonl file.
What is Selenium?
Selenium is an open-source browser automation tool, mainly used for testing web applications. It’s able to mimic user input such as mouse movements, key presses, and page navigation. There are also many methods which allow for element selection on the page. The main workhorse behind the library is the Webdriver, which makes automation of browser tasks a fairly straightforward affair.
Installing the necessary packages.
For this project, we are going to need to install Selenium along with a few other packages. Note: for this project I will be using a Mac.
To install Selenium, type the following in your terminal:
To manage our webdriver, we will use webdriver-manager. You can use Selenium to control most popular web browsers including: Firefox, Internet Explorer, Opera, Safari, and Chrome. I will be using Chrome.
Later, we will also need selectorlib for downloading and parsing the html pages we navigate to:
Setting up our environment.
Next create a new folder in the desktop, and add some files.
You will also need to place a file named “search_results.yml” into the project directory. This file will be used later to grab the information for each product on the page via their CSS selectors. You can find the file here.
Then open a code editor and import the following in the amazon_results_scraper.py file:
Let’s create a function called search_amazon, that take the string for the item we want to search for on Amazon as an input:
Using webdriver-manager we’ll install the correct version of the ChromeDriver:
Loading a page and selecting elements.
Selenium provides many methods for selecting page elements. We can select elements by: ID, class name, XPath, name, tag name, link text, and CSS Selector. You can also use relative locators to target page elements relative to other elements. For our purposes, we will be using ID, class name, and XPath. Let’s load the Amazon homepage. Under your driver element, type the following:
Open Chrome and navigate to the Amazon homepage, we need to find the locations of the page elements we want to interact with. For our purposes, we want to:
- Input the name of the item(s) we want to search for into the search bar.
- Click the search button.
- Navigate to the results page for the item(s).
- Iterate through the resulting pages.
Right click on the search bar and from the dropdown menu, click inspect. This should take you the browser developer tools. Then click this icon:
Hover over the search bar, then click the search bar to locate the element in the DOM:
The search bar is an ‘input’ element with and id of “twotabssearchtextbox”. We can interact with this item using Selenium by using the find_element_by_id() method, then send text input to it by chaining .send_keys(‘the text we want in the search box’) like so:
Next, let’s repeat the same steps we took to get the location of the search box, on the magnifying glass search button:
In order to click on items with Selenium, we first need to select the item, then chain .click() to the end of the statement:
After clicking search, we want to wait for the website to actually load the first page of results or else we will get errors. You could use:
but selenium has a built in method to tell the driver to wait a specified amount of seconds:
Now for the hard part. We want to find out how many pages of results we get, and iterate through each page. There are many elegant ways to do this, but we will use the quick and dirty solution. We are going to locate the item on the page that displays the number of results, and select it using it’s XPath.
As we can see, the number of result pages is displayed in the 6th list element ( tag) of the list with the class “a-pagination”. For fun, we are going to place two selections in a try/except block: one for the “a-pagination” tag, and if for whatever reason that fails, we will select the element underneath it with the class “a-last”.
When using Selenium, a common error is the NoSuchElementExcemption, which is thrown when Selenium cannot find an element on a page. This may happen if the element has not loaded yet, or if the position of elements on the page change. We can catch this error and try to select something else if our first option fails if we use a try-except:
Now let’s have our driver wait a few seconds:
We selected the element on the page that displays the number of result pages, and now we want to iterate through every page, collecting the current URL to a list that we will later feed to another script. Let’s take num_page, get the text from the element, cast it as an integer, and put it into a for loop:
Then, after we get all of the result page links, tell the driver to quit:
Remember the ‘search_results_urls.txt’ file we created earlier? We are going to need to open it from the function in ‘write’ mode, then place every URL from url_list into it on a new line:
Here is what we should have so far:
Integrate an Amazon Search Results Page Scraper into the script.
Now that we’ve written our function to search for our items and iterate through the results page, now we want to grab and save that data. To do this, we will use an Amazon search results page scraper from scrapehero-code.
The scrape function will use the URL’s in our text file, to download the HTML, extract relevant information like price, name, and product url. Then place it into the ‘search_results.yml’ file. Underneath your search_amazon() function, place the following:
Then call your search_amazon() function with the name of an item you want to search:
Lastly, we will place the driver code for the scrape(url) function after we call out search_amazon() function:
Voilà! After running this code your search_results_output.jsonl file will hold the information on all of the items scraped from your search.
Here is our completed script:
This script works well on broad searches, but will fail on more specific searches with items that return less than 5 pages of results. I will work to improve it in the future.
Amazon does not like automated scraping of their website and you should always consult the .robots file when doing any large-scale data collection. This project was educational and done for learning purposes. So if you get blocked, you’ve been warned!
You can check out the Github repository for this project here.