Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Building Recursive Descent Parsers with Python
Pages: 1, 2, 3, 4, 5

Pyparsing Is a "Combinator"

With the pyparsing module, you first define the basic pieces of your grammar. Then you combine them into more complex parse expressions for the various branches of the overall grammar syntax. Combine them by defining relationships such as:



  • Which expressions should follow each other in the grammar, such as "the keyword if is followed by a Boolean expression enclosed in parentheses"
  • Which expressions are valid alternatives at a certain point in the grammar, such as "a SQL command may start with SELECT, INSERT, UPDATE, or DELETE"
  • Which expressions are optional, as in "a phone number may optionally be preceded by an area code, enclosed in parentheses"
  • Which expressions are repeatable, such as "an opening XML tag may contain zero or more attributes"

Although some complex grammars can involve dozens or even hundreds of grammar combinations, many parsing tasks are easily performed with only a handful of definitions. Capturing the grammar in BNF helps organize your thoughts and parser design. It also helps you keep track of your progress in implementing the grammar with pyparsing's functions and classes.

Defining a Simple Grammar

The smallest building blocks of most grammars are typically exact strings of characters. For example, here is a simple BNF for parsing a phone number:

number      :: '0'.. '9'*
phoneNumber :: [ '(' number ')' ] number '-' number

Because this looks for dashes and parentheses within a phone number string, you can also define simple literal tokens for these punctuation marks:

dash   = Literal( "-" )
lparen = Literal( "(" )
rparen = Literal( ")" )

To define the groups of numbers in the phone number, you need to handle groups of characters of varying lengths. For this, use the Word token:

digits = "0123456789"
number = Word( digits )

The number token will match contiguous sequences made up of characters listed in the string digits; that is, it is a "word" composed of digits (as opposed to a traditional word, which is composed of letters of the alphabet). Now you have enough individual pieces of the phone number, so you can string them together using the And class.

phoneNumber = 
    And( [ lparen, number, rparen, number, dash, number ] )

This is fairly ugly, and unnatural to read. Fortunately, the pyparsing module defines operator methods to combine separate parse elements more easily. A more legible definition uses + for And:

phoneNumber = lparen + number + rparen + number + dash + number

For an even cleaner version, the + operator will join strings to parse elements, implicitly converting the strings to Literals. This gives the very easy-to-read:

phoneNumber = "(" + number + ")" + number + "-" + number

Finally, to designate that the area code at the beginning of the phone number is optional, use pyparsing's Optional class:

phoneNumber = Optional( "(" + number + ")" ) + number + "-" + number

Using the Grammar

Once you've defined your grammar, the next step is to apply it to the source text. Pyparsing expressions support three methods for processing input text with a given grammar:

  • The parseString method uses a grammar that completely specifies the contents of an input string, parses the string, and returns a collection of strings and substrings for each grammar construct.
  • The scanString method uses a grammar that may match only parts of an input string, scans the string looking for matches, and returns a tuple that contains the matched tokens and their starting and ending locations within the input string.
  • The transformString method is a variation on scanString. It applies any changes to the matched tokens and returns a single string representing the original input text, as modified by the individual matches.

The initial Hello, World! parser calls parseString and returns straightforward token results:

Hello, World! -> ['Hello', ',', 'World', '!']

Although this looks like a simple list of token strings, pyparsing returns data using a ParseResults object. In the example above, the results variable behaves like a simple Python list. In fact, you can index into the results just like a list:

print results[0]
print results[-2]

will print:

Hello
World

ParseResults also lets you define names for individual syntax elements, making it easier to retrieve bits and pieces of the parsed text. This is especially helpful when a grammar includes optional elements, which can change the length and offsets of the returned token list. By modifying the definitions of salute and greetee:

salute  = Word( alphas+"'" ).setResultsName("salute")
greetee = Word( alphas ).setResultsName("greetee")

you can reference the corresponding tokens as if they were attributes of the returned results object:

print hello, "->", results    
print results.salute
print results.greetee

Now the program will print:

G'day, Mate! -> ["G'day", ',', 'Mate', '!']
G'day
Mate

Results names can greatly help to improve the readability and maintainability of your parsing programs.

In the case of the phone number grammar, you can parse an input string containing a list of phone numbers, one after the next, as:

phoneNumberList = OneOrMore( phoneNumber )
data            = phoneNumberList.parseString( inputString )

This will return data as a pyparsing ParseResults object, containing a list of all of the input phone numbers.

Pyparsing includes some helper expressions, such as delimitedList, so that if your input were a comma-separated list of phone numbers, you could simply change phoneNumberList to:

phoneNumberList = delimitedList( phoneNumber )

This will return the same list of phone numbers that you had before. (delimitedList supports any custom string or expression as a delimiter, but comma delimiters are the most common, and so they are the default.)

If, instead of having a string containing only phone numbers, you had a complete mailing list of names, addresses, zip codes, and phone numbers, you could extract the phone numbers using scanString. scanString is a Python generator function, so you must use it in a for loop, list comprehension, or generator expression.

for data,dataStart,dataEnd in 
    phoneNumber.scanString( mailingListText ):
    .
    .
    # do something with the phone number tokens, 
    # returned in the 'data' variable
    .
    .

Lastly, if you had the same mailing list but wished to hide the numbers from, say, a potential telemarketer, you could transform the string by attaching a parse action that just changes all phone numbers to the string (000)000-0000. Replacing the input tokens with a fixed string is a common parse action, so pyparsing provides a built-in function replaceWith to make this very simple:

phoneNumber.setParseAction( replaceWith("(000)000-0000") )
sanitizedList = 
    phoneNumber.transformString( originalMailingListText )

Pages: 1, 2, 3, 4, 5

Next Pagearrow





Sponsored by: