Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Building Recursive Descent Parsers with Python
Pages: 1, 2, 3, 4, 5

When Good Input Goes Bad

Pyparsing will process input text until it runs out of matching text for its given parser elements. If it finds an unexpected token or character and there is no matching parsing element, then pyparsing will raise a ParseException. ParseExceptions print out a diagnostic message by default; they also have attributes to help you locate the line number, column, text line, and annotated line of text.



If you provide the input string Hello, World? to your parser, you will receive the exception:

pyparsing.ParseException: Expected "!" (at char 12), (line:1, col:13)

At this point, you can choose to fix the input text or make the grammar more tolerant of other syntax (in this case, supporting question marks as valid sentence terminators).

A Complete Application

Consider an application where you need to process chemical formulas, such as NaCl, H2O, or C6H5OH. For this application, the chemical formula grammar will be one or more element symbols, each followed by an optional integer. In BNF-style notation, this is:

integer       :: '0'..'9'+
cap           :: 'A'..'Z'
lower         :: 'a'..'z'
elementSymbol :: cap lower*
elementRef    :: elementSymbol [ integer ]
formula       :: elementRef+

The pyparsing module handles these concepts with the classes Optional and OneOrMore. The definition of the elementSymbol will use the two-argument constructor Word: the first argument lists the set of valid leading characters, and the second argument gives the set of valid body characters. Using the pyparsing module, a simple version of the grammar is:

caps       = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers     = caps.lower()
digits     = "0123456789"

element    = Word( caps, lowers )
elementRef = element + Optional( Word( digits ) )
formula    = OneOrMore( elementRef )

elements   = formula.parseString( testString )

So far, this program is an adequate tokenizer, processing the following formulas into their appropriate tokens. The default behavior for pyparsing is to return all of the parsed tokens within a single list of matching substrings:

H2O -> ['H', '2', 'O']
C6H5OH -> ['C', '6', 'H', '5', 'O', 'H']
NaCl -> ['Na', 'Cl']

Of course, you want to do some processing with these returned results, beyond simply printing them out as a list. Assume that you want to compute the molecular weight for each given chemical formula. The program somewhere defines a dictionary of chemical symbols and their corresponding atomic weight:

atomicWeight = {
    "O"  : 15.9994,
    "H"  : 1.00794,
    "Na" : 22.9897,
    "Cl" : 35.4527,
    "C"  : 12.0107,
    ...
    }

Next it would be good to establish a more logical grouping in the parsed chemical symbols and associated quantities, to return a structured set of results. Fortunately, the pyparsing module provides the Group class for just this purpose. By changing the elementRef declaration from:

elementRef = element + Optional( Word( digits ) )

to:

elementRef = Group( element + Optional( Word( digits ) ) )

you will now get the results grouped by chemical symbol:

H2O -> [['H', '2'], ['O']]
C6H5OH -> [['C', '6'], ['H', '5'], ['O'], ['H']]
NaCl -> [['Na'], ['Cl']]

The last simplification is to include a default value for the quantity part of elementRef, using the default argument for the constructor of the Optional class:

elementRef = Group( element + Optional( Word( digits ), 
                                default="1" ) )

Now every elementRef will return a pair of values: the element's chemical symbol and the number of atoms of that element, with "1" implied if no quantity is given. Now the test formulas return a very clean list of ordered pairs of element symbols and their respective quantities:

H2O -> [['H', '2'], ['O', '1']]
C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']]
NaCl -> [['Na', '1'], ['Cl', '1']]

The final step is to compute the atomic weight for each. Add a single line of Python code after the call to parseString:

wt = sum( [ atomicWeight[elem] * int(qty) 
                    for elem,qty in elements ] )

giving the results:

H2O -> [['H', '2'], ['O', '1']] (18.01528)
C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']]
        (94.11124)
NaCl -> [['Na', '1'], ['Cl', '1']] (58.4424)

Listing 2 contains the entire pyparsing program.

Listing 2

from pyparsing import Word, Optional, OneOrMore, Group, ParseException

atomicWeight = {
    "O"  : 15.9994,
    "H"  : 1.00794,
    "Na" : 22.9897,
    "Cl" : 35.4527,
    "C"  : 12.0107
    }
    
caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"

element = Word( caps, lowers )
elementRef = Group( element + Optional( Word( digits ), default="1" ) )
formula = OneOrMore( elementRef )

tests = [ "H2O", "C6H5OH", "NaCl" ]
for t in tests:
    try:
        results = formula.parseString( t )
        print t,"->", results,
    except ParseException, pe:
        print pe
    else:
        wt = sum( [atomicWeight[elem]*int(qty) for elem,qty in results] )
        print "(%.3f)" % wt

========================
H2O -> [['H', '2'], ['O', '1']] (18.015)
C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] (94.111)
NaCl -> [['Na', '1'], ['Cl', '1']] (58.442)

One of the nice by-products of using a parser is the inherent validation it performs on the input text. Note that in the calculation of the wt variable, there was no need to test that the qty string was all numeric, or to catch ValueError exceptions an invalid argument raised. If qty weren't all numeric, and therefore a valid argument to int(), it would not have passed the parser.

Pages: 1, 2, 3, 4, 5

Next Pagearrow





Sponsored by: