Informatics, Uncategorized

Web scraping using XPath and Visual Basic

More and more web services are popping up. Getting the weather in Paris or the price of a book on Amazon is only a REST, SOAP or XML-RPC call away. Unfortunately, most of the web content being delivered is still intended to be rendered in a browser and the providers of that content haven’t included a nice web service API for automated extraction.

Web scraping is a technique for extracting information from websites that don’t provide web services. One common way to find the required info on a webpage is to use regular expressions. A much better approach however, is to make use of XPath. XPath is a XML query language that can be used to locate a specific part of an XML document. One approach taken by some web scrapers is to parse an HTML document into an XML document and then use XPath to query the document. The problem with this approach is that most HTML out there won’t parse into XML too well. I will explain how to circumvent this problem below.

Let’s consider the following scenario. We want to automatically get a certain stock quote from the web, in this case the stock quote for AEGON, which happens to be my current employer. This data is published on the following website: http://www.marketwatch.com/quotes/aeg. We want to put the stock price in a cell in an Excel worksheet, and we’ll use VBA and the Microsoft Internet Control to navigate to the mentioned webpage and scrape the needed data.

First, we start by firing up Firefox and navigate to the webpage. Make sure you have the Firebug add-on installed and select the required data. Right-click and choose “Inspect element”.

Then hover over the required content in the Firebug pane, right click and select “Copy XPath”.

The XPath is now copied to the clipboard and looks like this: /html/body/form/div/div/div[7]/div[3]/div[3]/div/h2/span[2]/span.

Now we open a new Excel document, place a Button on the first worksheet and write the following code in it’s _Click() event.

Private Sub cmdGetQuote_Click()
    Dim ie As InternetExplorer
    Dim elem As HTMLBaseElement
    Dim url As String

    url = "http://www.marketwatch.com/quotes/aeg"

    Set ie = New InternetExplorer
    ie.Visible = True
    ie.Navigate url
    WaitBrowserQuiet ie

    Set elem = getXPathElement("/html/body/form/div/div/div[6]/div[3]/div[3]/div/h2/
span[2]/span", ie.Document)
    Range("A1").Value = elem.innerText

    Set ie = Nothing
End Sub

Make sure you set references to the Microsoft Internet Controls and the Microsoft HTML Object Library. The XPath we got using Firefox/Firebug is used as an argument to the function getXPathElement. I wrote this getXPathElement function to be able to address locations in a DOM document as one would normally do in an XML document.

Public Function getXPathElement(sXPath As String, objElement As HTMLBaseElement) As HTMLBaseElement
    Dim sXPathArray() As String

    Dim sNodeName As String
    Dim sNodeNameIndex As String
    Dim sRestOfXPath As String
    Dim lNodeIndex As Long
    Dim lCount As Long

    ' Split the xpath statement
    sXPathArray = Split(sXPath, "/")
    sNodeNameIndex = sXPathArray(1)
    If Not InStr(sNodeNameIndex, "[") > 0 Then
        sNodeName = sNodeNameIndex
        lNodeIndex = 1
    Else
        sXPathArray = Split(sNodeNameIndex, "[")
        sNodeName = sXPathArray(0)
        lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1))
    End If
    sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1))

    Set getXPathElement = Nothing
    For lCount = 0 To objElement.childNodes().Length - 1
        If UCase(objElement.childNodes().Item(lCount).nodeName) = UCase(sNodeName) Then
            If lNodeIndex = 1 Then
                If sRestOfXPath = "" Then
                    Set getXPathElement = objElement.childNodes().Item(lCount)
                Else
                    Set getXPathElement = getXPathElement(sRestOfXPath, objElement.childNodes().Item(lCount))
                End If
            End If
            lNodeIndex = lNodeIndex - 1
        End If
    Next lCount
End Function

This function doesn’t support the full XPath syntax, but it serves our purposes. Clicking the button opens an Internet Explorer window and get’s the stock price value in cell A1. Download the full project here.

3 Comments

speak up

Add your comment below, or trackback from your own site.

Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*Required Fields