Web scraping using XPath and Visual Basic

More and more web services are popping up. Getting the weather in Paris or the price of a book on Amazon is only a REST, SOAP or XML-RPC call away. Unfortunately, most of the web content being delivered is still intended to be rendered in a browser and the providers of that content haven’t included a nice web service API for automated extraction.

Web scraping is a technique for extracting information from websites that don’t provide web services. One common way to find the required info on a webpage is to use regular expressions. A much better approach however, is to make use of XPath. XPath is a XML query language that can be used to locate a specific part of an XML document. One approach taken by some web scrapers is to parse an HTML document into an XML document and then use XPath to query the document. The problem with this approach is that most HTML out there won’t parse into XML too well. I will explain how to circumvent this problem below.

Let’s consider the following scenario. We want to automatically get a certain stock quote from the web, in this case the stock quote for AEGON, which happens to be my current employer. This data is published on the following website: http://www.marketwatch.com/quotes/aeg. We want to put the stock price in a cell in an Excel worksheet, and we’ll use VBA and the Microsoft Internet Control to navigate to the mentioned webpage and scrape the needed data.

First, we start by firing up Firefox and navigate to the webpage. Make sure you have the Firebug add-on installed and select the required data. Right-click and choose “Inspect element”.

Then hover over the required content in the Firebug pane, right click and select “Copy XPath”.

The XPath is now copied to the clipboard and looks like this: /html/body/form/div/div/div[7]/div[3]/div[3]/div/h2/span[2]/span.

Now we open a new Excel document, place a Button on the first worksheet and write the following code in it’s _Click() event.

Make sure you set references to the Microsoft Internet Controls and the Microsoft HTML Object Library. The XPath we got using Firefox/Firebug is used as an argument to the function getXPathElement. I wrote this getXPathElement function to be able to address locations in a DOM document as one would normally do in an XML document.

This function doesn’t support the full XPath syntax, but it serves our purposes. Clicking the button opens an Internet Explorer window and get’s the stock price value in cell A1. Download the full project here.

6 thoughts on “Web scraping using XPath and Visual Basic

  1. Aswin,

    Your code works for me with Excel-2010 VBA and IE8 on any website so far. Of course I need some individual adjustments by case.

    I have further tried to develop a shorter code, since the user defined Function getXPathElement runs a lot of loops that accummulate when I have an array of e.g. 5 XPathes to scrap 5 elements in batch progressing (ran 5×10000 items at 2sec/set of 5) from 1 web-page earch engine as following:

    strPath(1) = “/html/body/div[2]/div/div/div[2]/div/div/div[4]/div/p/strong”
    strPath(2) = “/html/body/div[2]/div/div/div[2]/div/div[2]/div[4]/div/p/strong”
    strPath(3) = “/html/body/div[2]/div/div/div[2]/div/div[3]/div[4]/div/p/strong”
    strPath(4) = “/html/body/div[2]/div/div/div[2]/div/div/div[2]/h2/a”
    strPath(5) = “/html/body/div[2]/div/div/div[2]/div/div[3]/div[2]/p[2]”

    For n = 1 To 5
    ‘ get last node:
    strPathArray = Split(strPath(n), “/”)
    lastNode = strPathArray(UBound(strPathArray))

    ‘ get element by TagName fpr strpath(n) -> item 1,2,3,4,5:
    Set elem = ie.document.getElementsByTagName(lastNode).item(n)
    ws.Cells(iRow, oCol).Value = elem.innerHTML

    Next n

    The only problem is that indexed tags with [ ] – like last node from strPath(5) p[2] – do not provide the innerHTML, while all others strPath(1) thru (4) work fine.

    If that’s proper coding – well who knows?

    Since you are most deeply involved in this matter, maybe we can discuss another solution?

  2. Hi there!
    I am trying to use this code directly in vb.net

    any ideas how to make it work?

    what I am trying to do, is:
    download the HTML and get some info.
    I know how to do it, but I like the idea of using xPath.

    thanks!!

  3. Ik heb geen comment maar een vraag. Je artikel over Xpath lijkt voor mij een eind in de goede richtingj, maar zoek daarbij nog iets specifieks.

    Ik wil vanuit VB.NET code een knop ‘CW-narrow’ klikken op de volgende website : http://websdr.ewi.utwente.nl:8901/.

    Dit een een web-based kortegolf ontvanger op de TU in Twente. Ik heb wel voor elkaar om de ‘MUTE’ checkbox vanuit VB aan te zetten. De CW-narrow knop zit in een tabel. Hoe kan krijg ik deze know geklikt?

    Ik gebruik in mijn applicatie de Gecko webbrowser.

    Kan jij mij helpen?

    M vr grt Onno de Meijer

Leave a Reply

Your email address will not be published. Required fields are marked *