Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Building Recursive Descent Parsers with Python
Pages: 1, 2, 3, 4, 5

An HTML Scraper

As a final example, consider the development of a simple HTML "scraper." It is not a comprehensive HTML parser, as such a parser would require scores of parse expressions. Fortunately, it is not usually necessary to have a complete HTML grammar definition to be able to extract significant pieces of data from most web pages, especially those autogenerated by CGI or other application programs.



This example will extract data with a minimal parser, targeted to work with a specific web page--in this case, the page kept by NIST listing publicly available network time protocol (NTP) servers. This routine could be part of a larger NTP client application that, during its initialization, would look up what NTP servers are currently available.

To begin developing an HTML scraper, you must first see what sort of HTML text you will need to process. By visiting the web site and viewing the returned HTML source, you can see that the page lists the names and IP addresses of the NTP servers in an HTML table:

Name IP Address Location
time-a.nist.gov 129.6.15.28 NIST, Gaithersburg, Maryland
time-b.nist.gov 129.6.15.29 NIST, Gaithersburg, Maryland

The underlying HTML source for this table uses <table>, <tr>, and <td> tags to structure the NTP server data:

<table border="0" cellpadding="3" cellspacing="3" frame="" width="90%">
                <tr align="left" valign="top">
                        <td><b>Name</b></td>
                        <td><b>IP Address</b></td>
                        <td><b>Location</b></td>
                </tr>
                <tr align="left" valign="top" bgcolor="#c7efce">
                        <td>time-a.nist.gov</td>
                        <td>129.6.15.28</td>
                        <td>NIST, Gaithersburg, Maryland</td>
                </tr>
                <tr align="left" valign="top">
                        <td>time-b.nist.gov</td>
                        <td>129.6.15.29</td>
                        <td>NIST, Gaithersburg, Maryland</td>
                </tr>
       ...

This table is part of a much larger body of HTML, but pyparsing allows you to define a parse expression that matches only a subset of the total input text and to scan for text that matches the given parse expression. So you need only define the minimum amount of grammar required to match the desired HTML source.

The program should extract the IP addresses and locations of those servers, so you can focus your grammar on just those columns of the table. Informally, you want to extract the values that match the pattern

<td> IP address </td> <td> location name </td>

You do want to be a bit more specific than just matching on something as generic as <td> any text </td> <td> more any text </td>, because so general an expression would match the first two columns of the table instead of the second two (as well as the first two columns of any table on the page!). Instead, use the specific format of the IP address to help narrow your search pattern by eliminating any false matches from other table data on the page.

To build up the elements of an IP address, start by defining an integer, then combining four integers with intervening periods:

integer   = Word("0123456789")
ipAddress = integer + "." + integer + "." + integer + "." + integer

You will also need to match the HTML tags <td> and </td>, so define parse elements for each:

tdStart = Literal("<td>")
tdEnd   = Literal("</td>")

In general, <td> tags can also contain attribute specifiers for alignment, color, and so on. However, this is not a general-purpose parser, only one written specifically for this web page, which fortunately does not use complicated <td> tags. (The latest version of pyparsing includes a helper method for constructing HTML tags, which supports attribute specifiers in opening tags.)

Finally, you need some sort of expression to match the server's location description. This is actually a rather freely formatted bit of text--there's no knowing whether it will include alphabetic data, commas, periods, or numbers--so the simplest choice is to just accept everything up to the terminating </td> tag. Pyparsing includes a class named SkipTo for this kind of grammar element.

You now have all the pieces you need to define the time server text pattern:

timeServer = tdStart + ipAddress + tdEnd + \
                 tdStart + SkipTo(tdEnd) + tdEnd

To extract the data, invoke timeServer.scanString, which is a generator function that yields the matched tokens and the start and end string positions for each matching set of text. This application uses only the matched tokens.

Pages: 1, 2, 3, 4, 5

Next Pagearrow





Sponsored by: