ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

XSLT Processing with Java
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9

The code in Example 5-7 shows the complete implementation of the CSV parser.


Example 5-7: CSVXMLReader.java

package com.oreilly.javaxslt.util;
 
import java.io.*;
import java.net.URL;
 
import org.xml.sax.*;
import org.xml.sax.helpers.*;
 
 
/**
* A utility class that parses a Comma
* Separated Values (CSV) file and outputs its
* contents using SAX2 events. The format of CSV
* that this class reads is identical to the export
* format for Microsoft Excel. For simple values, the
* CSV file may look like this:
* <pre>
* a,b,c
* d,e,f
* </pre>
* Quotes are used as delimiters when the values
* contain commas:
* <pre>
* a,"b,c",d
* e,"f,g","h,i"
* </pre>
* And double quotes are used when the values
* contain quotes. This parser is smart enough
* to trim spaces around commas, as well.
*
* @author Eric M. Burke
*/
public class CSVXMLReader extends AbstractXMLReader {
 
  // an empty attribute for use with SAX
  private static final Attributes EMPTY_ATTR = new AttributesImpl( );
 
  /**
   * Parse a CSV file. SAX events are
   * delivered to the ContentHandler
   * that was registered via
   * <code>setContentHandler</code>.
   *
   * @param input the comma separated
   * values file to parse.
   */  public void parse(InputSource input) throws IOException,
      SAXException {
    // if no handler is registered to receive events, don't bother
    // to parse the CSV file
    ContentHandler ch = getContentHandler( );
    if (ch == null) {
      return;
    }
 
    // convert the InputSource into a BufferedReader
    BufferedReader br = null;
    if (input.getCharacterStream( ) != null) {
      br = new BufferedReader(input.getCharacterStream( ));
    } else if (input.getByteStream( ) != null) {
      br = new BufferedReader(new InputStreamReader(
          input.getByteStream( )));
    } else if (input.getSystemId( ) != null) {
      java.net.URL url = new URL(input.getSystemId( ));
      br = new BufferedReader(new InputStreamReader(url.openStream( )));
    } else {
      throw new SAXException("Invalid InputSource object");
    }
 
    ch.startDocument( );
 
    // emit <csvFile>
    ch.startElement("","","csvFile",EMPTY_ATTR);
 
    // read each line of the file until EOF is reached
    String curLine = null;
    while ((curLine = br.readLine( )) != null) {
      curLine = curLine.trim( );
      if (curLine.length( ) > 0) {
        // create the <line> element
        ch.startElement("","","line",EMPTY_ATTR);
        // output data from this line
        parseLine(curLine, ch);
        // close the </line> element
        ch.endElement("","","line");

/code>
    }
 
    // emit </csvFile>
    ch.endElement("","","csvFile");
    ch.endDocument( );
  }
 
  // Break an individual line into tokens.
  // This is a recursive function
  // that extracts the first token, then
  // recursively parses the
  // remainder of the line.
  private void parseLine(String curLine, ContentHandler ch)
    throws IOException, SAXException {
 
    String firstToken = null;
    String remainderOfLine = null;
    int commaIndex = locateFirstDelimiter(curLine);
    if (commaIndex > -1) {
      firstToken = curLine.substring(0, commaIndex).trim( );
      remainderOfLine = curLine.substring(commaIndex+1).trim( );
    } else {
      // no commas, so the entire line is the token
      firstToken = curLine;
    }
 
    // remove redundant quotes
    firstToken = cleanupQuotes(firstToken);
 
    // emit the <value> element
    ch.startElement("","","value",EMPTY_ATTR);
    ch.characters(firstToken.toCharArray(), 0, firstToken.length( ));
    ch.endElement("","","value");
 
    // recursively process the remainder of the line
    if (remainderOfLine != null) {
      parseLine(remainderOfLine, ch);
    }
  }
 
  // locate the position of the comma,
  // taking into account that
  // a quoted token may contain ignorable commas.
  private int locateFirstDelimiter(String curLine) {
    if (curLine.startsWith("\"")) {
      boolean inQuote = true;
      int numChars = curLine.length( );
      for (int i=1; i<numChars; i++) {
        char curChar = curLine.charAt(i);
        if (curChar == '"') {
          inQuote = !inQuote;
        } else if (curChar == ',' && !inQuote) {
          return i;
        }
      }
      return -1;
    } else {
      return curLine.indexOf(',');
    }
  }
 
  // remove quotes around a token, as well as pairs of quotes
  // within a token.
  private String cleanupQuotes(String token) {
    StringBuffer buf = new StringBuffer( );
    int length = token.length( );
    int curIndex = 0;
 
    if (token.startsWith("\"") && token.endsWith("\"")) {
      curIndex = 1;
      length--;
    }
 
    boolean oneQuoteFound = false;
    boolean twoQuotesFound = false;
 
    while (curIndex < length) {
      char curChar = token.charAt(curIndex);
      if (curChar == '"') {
        twoQuotesFound = (oneQuoteFound) ? true : false;
oneQuoteFound = true;
      } else {
        oneQuoteFound = false;
        twoQuotesFound = false;
      }
 
      if (twoQuotesFound) {
        twoQuotesFound = false;
        oneQuoteFound = false;
        curIndex++;
        continue;
      }
 
      buf.append(curChar);
      curIndex++;
    }
 
    return buf.toString( );
  }
}


CSVXMLReader is a subclass of AbstractXMLReader, so it must provide an implementation of the abstract parse method:

public void parse(InputSource input) throws IOException,
      SAXException {
    // if no handler is registered to receive 
    // events, don't bother
    // to parse the CSV file
    ContentHandler ch = getContentHandler( );
    if (ch == null) {
      return;
    }

The first thing this method does is check for the existence of a SAX ContentHandler. The base class, AbstractXMLReader, provides access to this object, which is responsible for listening to the SAX events. In our example, an instance of JAXP's TransformerHandler is used as the SAX ContentHandler implementation. If this handler is not registered, our parse method simply returns because nobody is registered to listen to the events. In a real SAX parser, the XML would be parsed anyway, which provides an opportunity to check for errors in the XML data. Choosing to return immediately was merely a performance optimization selected for this class.

The SAX InputSource parameter allows our custom parser to locate the CSV file. Since an InputSource has many options for reading its data, parsers must check each potential source in the order shown here:

// convert the InputSource into a BufferedReader
BufferedReader br = null;
if (input.getCharacterStream( ) != null) {
  br = new BufferedReader(input.getCharacterStream( ));
} else if (input.getByteStream( ) != null) {
  br = new BufferedReader(new InputStreamReader(
    input.getByteStream( )));
} else if (input.getSystemId( ) != null) {
  java.net.URL url = new URL(input.getSystemId( ));
  br = new BufferedReader(new InputStreamReader(url.openStream( )));
} else {
  throw new SAXException("Invalid InputSource object");
}

Assuming that our InputSource was valid, we can now begin parsing the CSV file and emitting SAX events. The first step is to notify the ContentHandler that a new document has begun:

ch.startDocument(  );
 
// emit <csvFile>
ch.startElement("","","csvFile",EMPTY_ATTR);

The XSLT processor interprets this to mean the following:

<?xml version="1.0" encoding="UTF-8"?>
<csvFile>

Our parser simply ignores many SAX 2 features, particularly XML namespaces. This is why many values passed as parameters to the various ContentHandler methods simply contain empty strings. The EMPTY_ATTR constant indicates that this XML element does not have any attributes.

The CSV file itself is very straightforward, so we merely loop over every line in the file, emitting SAX events as we read each line. The parseLine method is a private helper method that does the actual CSV parsing:

// read each line of the file until EOF is reached
String curLine = null;
while ((curLine = br.readLine(  )) != null) {
    curLine = curLine.trim(  );
    if (curLine.length(  ) > 0) {
        // create the <line> element
        ch.startElement("","","line",EMPTY_ATTR);
        parseLine(curLine, ch);
        ch.endElement("","","line");
    }
}

And finally, we must indicate that the parsing is complete:

// emit </csvFile>
ch.endElement("","","csvFile");
ch.endDocument(  );

The remaining methods in CSVXMLReader are not discussed in detail here because they are really just responsible for breaking down each line in the CSV file and checking for commas, quotes, and other mundane parsing tasks. One thing worth noting is the code that emits text, such as the following:

<value>Some Text Here</value>

SAX parsers use the characters method on ContentHandler to represent text, which has this signature:

public void characters(char[] ch, int start, int length)

Although this method could have been designed to take a String, using an array allows SAX parsers to preallocate a large character array and then reuse that buffer repeatedly. This is why an implementation of ContentHandler cannot simply assume that the entire ch array contains meaningful data. Instead, it must read only the specified number of characters beginning at the start position.

Our parser uses a relatively straightforward approach, simply converting a String to a character array and passing that as a parameter to the characters method:

// emit the <value>text</value> element
ch.startElement("","","value",EMPTY_ATTR);
ch.characters(firstToken.toCharArray(), 0, firstToken.length(  ));
ch.endElement("","","value");

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9

Next Pagearrow