Web scraping using XPath and Visual Basic

More and more web services are popping up. Getting the weather in Paris or the price of a book on Amazon is only a REST, SOAP or XML-RPC call away. Unfortunately, most of the web content being delivered is still intended to be rendered in a browser and the providers of that content haven’t included a nice web service API for automated extraction.

Web scraping is a technique for extracting information from websites that don’t provide web services. One common way to find the required info on a webpage is to use regular expressions. A much better approach however, is to make use of XPath. XPath is a XML query language that can be used to locate a specific part of an XML document. One approach taken by some web scrapers is to parse an HTML document into an XML document and then use XPath to query the document. The problem with this approach is that most HTML out there won’t parse into XML too well. I will explain how to circumvent this problem below.

Let’s consider the following scenario. We want to automatically get a certain stock quote from the web, in this case the stock quote for AEGON, which happens to be my current employer. This data is published on the following website: http://www.marketwatch.com/quotes/aeg. We want to put the stock price in a cell in an Excel worksheet, and we’ll use VBA and the Microsoft Internet Control to navigate to the mentioned webpage and scrape the needed data.

First, we start by firing up Firefox and navigate to the webpage. Make sure you have the Firebug add-on installed and select the required data. Right-click and choose “Inspect element”.

Then hover over the required content in the Firebug pane, right click and select “Copy XPath”.

The XPath is now copied to the clipboard and looks like this: /html/body/form/div/div/div[7]/div[3]/div[3]/div/h2/span[2]/span.

Now we open a new Excel document, place a Button on the first worksheet and write the following code in it’s _Click() event.

Make sure you set references to the Microsoft Internet Controls and the Microsoft HTML Object Library. The XPath we got using Firefox/Firebug is used as an argument to the function getXPathElement. I wrote this getXPathElement function to be able to address locations in a DOM document as one would normally do in an XML document.

This function doesn’t support the full XPath syntax, but it serves our purposes. Clicking the button opens an Internet Explorer window and get’s the stock price value in cell A1. Download the full project here.