Web Scraping 101 (Using Selenium for Java)

Gal Abramovitz
6 min readAug 13, 2018
Photo by rawpixel on Unsplash

Web Scraping is one of the most useful skills in today’s digital world.
Basically it takes web-browsing to the next level, by automatizing everyday actions, such as opening URLs, reading text and data, clicking links, etc.

Even though web scraping is relatively easy to learn and execute, it’s a powerful tool that you can use to collect existing data from websites, then easily manipulate, analyze and store it to your liking; or to automate workflows in web-based platforms.

In this tutorial I’ll show you step-by-step:

  1. How to set up Selenium in IntelliJ platform.
  2. How to build a basic web scraper, that can read data from a webpage.

First things first, we’ll set up Selenium in IntelliJ environment:

* The stages are followed by matching screenshots

  1. I found it easiest to use Selenium Standalone Server. Download the JAR file in the given link.
  2. Open IntelliJ IDEA and create a new project.
  3. Right-click one of your project’s directories and click Open Module Settings.
  4. In the Modules section click the Dependencies tab, then click the “+” button and choose JARs or directories…
  5. Choose the JAR file you’ve downloaded in stage 1, then click OK.
  6. The External Libraries should now contain the JAR file.
  7. Download Geckodriver (choose the file according to your operating system) and put the executable file in the project’s folder.

That’s it - you’re all set!
You can take a look at the self-explanatory screenshots and move on to the second part of the tutorial.

Stage 3: click Open Module Settings
Stage 4: in the Modules section click the Dependencies tab, then click the “+” button and choose JARs or directories…
Stage 6: The External Libraries should now contain the JAR file
Stage 7: your project’s directory should look like this

How to build a basic web scraper

After we’ve set up our working environment, let’s use Selenium to build a basic web scraper.
We’ll use the HTML structure of a webpage and read specific data (we’ll actually take advantage of the CSS code, but the principles are the same).

Preparing for web scraping

Before we actually start to scrape, we need to understand what we’re looking for. This part will be much easier for you if you can read HTML fluently.

In this tutorial we’ll scrape my website and extract the list of languages I use. It might seem a tough task at first, but a glance into the paragraph’s HTML code will reveal a surprising structure:

<div class="line">
In my work and personal projects I use
<p>Java</p>,
<p>JavaScript</p>,
<p>jQuery</p>,
<p>HTML</p> and
<p>CSS</p>. My university projects also use
<p>C</p>,
<p>C++</p>,
<p>Assembly</p>,
<p>SQL</p> and
<p>Python</p>.<br>
As I'm highly aware to the elegance of my products, I keep learning and pushing myself towards cleaner code and more beautiful UI.
</div>

You can tell immediately that each language name is wrapped in a separate<p> tag. We’ll use this fact later in our scraper’s code.

I obviously already know my own website, but unravelling each page’s code is the first challenge that you’ll need to solve. In most cases there is an obvious logic to the HTML structure, that we can take advantage of.
Understand the HTML logic - and your life would get much easier.

The next step would be to plan the scraping workflow. I suggest you go through your workflow manually a couple of times before writing your web scraper.

In this basic example, this is the workflow we’re going to implement:

  1. Open a new browser window.
  2. Navigate to http://www.galabra.co.il.
  3. Click the menu option “ABOUT”.
  4. Copy the languages names, according to the logic we’ve described.

Selenium Building Blocks

I like to think about a (basic) web scraper as a synchronous program that mainly uses two objects instances:

  • A WebDriver instance, which is our code’s connection to the browser’s window: it allows to read its code and simulate user’s actions.
    Each WebDriver instance corresponds to an individual browser window.
    (in this tutorial I’ll use a FirefoxDriver, just because I usually use Chrome and I like to have a separate browser for my web scraping applications)
  • A WebDriverWait instance, which allows us to synchronize our actions (e.g. click a button only after it loads and is clickable).

It might seem a bit abstract, but don’t worry - in a moment we’ll use both of them in context.

Finally - let’s write some code!

(The full code is provided in this repository)

  1. We’ll start by initializing a FirefoxDriver and a WebDriverWait:
    FirefoxDriver driver = new FirefoxDriver();
    WebDriverWait wait = new WebDriverWait(driver, 30);
    Note that WebDriverWait remains the same, regardless of the browser you choose, and that it’s connected to the FirefoxDriver instance. The second argument in its constructor represents the maximum amount of time that our program should wait for an action (e.g. a page load).
  2. After initialization, we have one new Firefox window that’s controlled by our program. Now let’s navigate to a specific webpage:
    driver.navigate().to("http://www.galabra.co.il");
  3. Now we want to click an element. We’ll need to specify a one-to-one address that’ll point this exact element. There are a few addresses we can use, but I find XPath to be the most convenient one.
    In order to find the button’s XPath I’m going to use Chrome’s Developer Tools: right-click the element, then click Inspect.
    The page’s HTML code will be opened and the inspected element will be highlighted. Right-click it, click Copy and then Copy XPath.
    Now you have the element’s XPath that can be used as its address! Let’s save it as a String:
    String aboutButtonXpath = "//*[@id=\"about\"]div/a";
  4. The code is executed regardless of the browser’s state, so we need to make sure the button element is loaded before we try to click it. As mentioned, we can use our wait instance just for that (one of the reasons I love Selenium is because the syntax is self-explanatory):
    wait.until(ExpectedConditions.elementToBeClickable(By.xpath(aboutButtonXpath)));
  5. Our WebDriverWait instance blocks our thread until the button element is clickable. Hence, right after it we can click it:
    driver.findElement(By.xpath(aboutButtonXpath)).click();
    Note how the click is called via driver.
  6. Using the same method, we make sure the relevant paragraph is loaded before trying to copy its data:
    String languagesParagraphXpath = “//*[@id=\"page1\"]/div[2]/div[5]”;
    wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(languagesParagraphXpath)));
  7. Now we can use the logic that we’ve defined in the workflow. Using driver.findElements() method, we’ll retrieve a list of WebElements, each representing a <p> that contains a language name:
    List<WebElement> languageNamesList = driver.findElements(By.xpath("//*[@id=\"page1\"]/div[2]/div[5]/p"));
    Note that the last level in the XPath’s hierarchy doesn’t have an index. That’s because this XPath points to the array of all elements that satisfy this HTML tags hierarchy.
    Another detail worth mentioning is that the method used here is driver.findElements (in plural form), which suggests that the expected return value is a list.
  8. We’re finished scraping, so we can close the browser window:
    driver.close();
  9. Lastly, I chose to just print the language names. However, you can do anything you want with them, as they’re now accessible from within your code!
Stage 3: finding an element’s XPath

That’s it. I hope I was able to give you a basic understanding of Selenium for Java, its applications and its implementation.
Please don’t hesitate to contact me for any questions.

Thank you for your time!

--

--