More and more web services are popping up. Getting the weather in Paris or the price of a book on Amazon is only a REST, SOAP or XML-RPC call away. Unfortunately, most of the web content being delivered is still intended to be rendered in a browser and the providers of that content haven’t included a nice web service API for automated extraction.
Web scraping is a technique for extracting information from websites that don’t provide web services. One common way to find the required info on a webpage is to use regular expressions. A much better approach however, is to make use of XPath. XPath is a XML query language that can be used to locate a specific part of an XML document. One approach taken by some web scrapers is to parse an HTML document into an XML document and then use XPath to query the document. The problem with this approach is that most HTML out there won’t parse into XML too well. I will explain how to circumvent this problem below.
Let’s consider the following scenario. We want to automatically get a certain stock quote from the web, in this case the stock quote for AEGON, which happens to be my current employer. This data is published on the following website: http://www.marketwatch.com/quotes/aeg. We want to put the stock price in a cell in an Excel worksheet, and we’ll use VBA and the Microsoft Internet Control to navigate to the mentioned webpage and scrape the needed data.
First, we start by firing up Firefox and navigate to the webpage. Make sure you have the Firebug add-on installed and select the required data. Right-click and choose “Inspect element”.
Then hover over the required content in the Firebug pane, right click and select “Copy XPath”.
The XPath is now copied to the clipboard and looks like this: /html/body/form/div/div/div[7]/div[3]/div[3]/div/h2/span[2]/span.
Now we open a new Excel document, place a Button on the first worksheet and write the following code in it’s _Click() event.
Private Sub cmdGetQuote_Click()
Dim ie As InternetExplorer
Dim elem As HTMLBaseElement
Dim url As String
url = "http://www.marketwatch.com/quotes/aeg"
Set ie = New InternetExplorer
ie.Visible = True
ie.Navigate url
WaitBrowserQuiet ie
Set elem = getXPathElement("/html/body/form/div/div/div[6]/div[3]/div[3]/div/h2/span[2]/span", ie.Document)
Range("A1").Value = elem.innerText
Set ie = Nothing
End Sub
Make sure you set references to the Microsoft Internet Controls and the Microsoft HTML Object Library. The XPath we got using Firefox/Firebug is used as an argument to the function getXPathElement. I wrote this getXPathElement function to be able to address locations in a DOM document as one would normally do in an XML document.
Public Function getXPathElement(sXPath As String, objElement As HTMLBaseElement) As HTMLBaseElement
Dim sXPathArray() As String
Dim sNodeName As String
Dim sNodeNameIndex As String
Dim sRestOfXPath As String
Dim lNodeIndex As Long
Dim lCount As Long
' Split the xpath statement
sXPathArray = Split(sXPath, "/")
sNodeNameIndex = sXPathArray(1)
If Not InStr(sNodeNameIndex, "[") > 0 Then
sNodeName = sNodeNameIndex
lNodeIndex = 1
Else
sXPathArray = Split(sNodeNameIndex, "[")
sNodeName = sXPathArray(0)
lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1))
End If
sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1))
Set getXPathElement = Nothing
For lCount = 0 To objElement.childNodes().Length - 1
If UCase(objElement.childNodes().Item(lCount).nodeName) = UCase(sNodeName) Then
If lNodeIndex = 1 Then
If sRestOfXPath = "" Then
Set getXPathElement = objElement.childNodes().Item(lCount)
Else
Set getXPathElement = getXPathElement(sRestOfXPath, objElement.childNodes().Item(lCount))
End If
End If
lNodeIndex = lNodeIndex - 1
End If
Next lCount
End Function
This function doesn’t support the full XPath syntax, but it serves our purposes. Clicking the button opens an Internet Explorer window and get’s the stock price value in cell A1. Download the full project here.


Ja ik zei al dat je dat het best zo kon doen..
Mathematician’s Birthday Calendar…
In my previous post I described a way of doing web scraping using XPath and VB. I showed how to get stock quotes from the web into Excel using this method. Recently I used the same method to get a list of birthdays of mathematicians. The data was scrap…
Aswin,
Your code works for me with Excel-2010 VBA and IE8 on any website so far. Of course I need some individual adjustments by case.
I have further tried to develop a shorter code, since the user defined Function getXPathElement runs a lot of loops that accummulate when I have an array of e.g. 5 XPathes to scrap 5 elements in batch progressing (ran 5×10000 items at 2sec/set of 5) from 1 web-page earch engine as following:
strPath(1) = “/html/body/div[2]/div/div/div[2]/div/div/div[4]/div/p/strong”
strPath(2) = “/html/body/div[2]/div/div/div[2]/div/div[2]/div[4]/div/p/strong”
strPath(3) = “/html/body/div[2]/div/div/div[2]/div/div[3]/div[4]/div/p/strong”
strPath(4) = “/html/body/div[2]/div/div/div[2]/div/div/div[2]/h2/a”
strPath(5) = “/html/body/div[2]/div/div/div[2]/div/div[3]/div[2]/p[2]”
For n = 1 To 5
‘ get last node:
strPathArray = Split(strPath(n), “/”)
lastNode = strPathArray(UBound(strPathArray))
‘ get element by TagName fpr strpath(n) -> item 1,2,3,4,5:
Set elem = ie.document.getElementsByTagName(lastNode).item(n)
ws.Cells(iRow, oCol).Value = elem.innerHTML
Next n
The only problem is that indexed tags with [ ] – like last node from strPath(5) p[2] – do not provide the innerHTML, while all others strPath(1) thru (4) work fine.
If that’s proper coding – well who knows?
Since you are most deeply involved in this matter, maybe we can discuss another solution?