LinuxDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


Building Unix Tools with Ruby
Pages: 1, 2, 3

Get the Plumbing Right

With option parsing code in place, you are now ready to add code for processing CSV files and for making your script behave like a proper command line tool.



It is an old Unix tradition that commands can be piped together to create more complex tools. Your script should obey that convention; doing so will make it more flexible and allow other users do things the authors of the software have never dreamed of.

Writing a Ruby script that fits into that scheme is actually very simple. The simplest piece of code that copies everything from STDIN to STDOUT is just three lines long:

while gets
    print 
end

Add it at the end of your script and see how it works. You do not need to worry about the way data is sent to your script. Both examples shown below give the same results, all without writing additional code.

$ cat file1 file2 | csvt -e 2,0
$ csvt -e 2,0 file1 file2

Processing Input

The simple loop shown in Section 6 is not very useful, because it it does not do any processing of input. It does illustrate the general concept. The csvt script will use two such loops, one for --extract and one for --remove. Both start with a test of the appropriate flag, extract_f for --extract and remove_f for --remove.

if extract_f == true
     first_f = true

The first_f flag is used to avoid the "off by one" error inside the while loop:

while gets
        data   = $_.chop
        data   = data.split(",")
        data_n = data.length

Every loop cycle starts with a call to gets, which reads a new line from STDIN and stores it in $_. Next the script removes the end of line character and splits the line into an array of separate columns.

        if first_f
            old_data_n = data_n
               first_f = false
        end

The size of the array is stored in data_n. Then it tests if the line just read was the first line and sets the number of columns on the non-existent previous line to the number of columns on the first line to pass the data integrity check (comparing the number of columns in the previous and the current line).

        if data_n != old_data_n
            $stderr.print "csvt: the number of fields on the "
                        + "following line does not match the number "
                        + "of fields on the previous line\n"
            $stderr.print $_
            exit(1)
        end

Should the data integrity test fail, the error message followed by the offending line will be printed to the system log and the execution of csvt will stop. It is tempting to relax the rules a little and introduce an option for skipping such errors, but that's a job for a separate tool; namely, a specialized data integrity checker, which is usually written with a particular data set in mind and therefore outside the scope of the csvt's specification.

When everything goes well, we can begin constructing a line of output. This starts with initializing the line variable:

line = ""

Next we travel the array of arguments for the --extract option. As you will notice, there is test check, if the column index is less than the number of fields in the line we just read. If it is, csvt will complain, suggest the allowed range of indexes and exit with code 1.

        extract_args.each do |column|
 
            if !(column < data_n)
                $stderr.print "csvt: column index out of range, "
                            + "use numbers between 0 and ", 
                              data_n - 1, "\n"
                exit(1)
            end

If all goes well, we use the value of column as the index into the data array and add the result to the string stored in line, followed by a comma.

            line += data[column] + ","
        end

Once all columns listed as arguments of --extract have been processed, we can print the contents of the line variable, less the last character, which we replace with the end of line character.

print line[0, line.length-1], "\n"

The last thing is setting the old_data_n variable to the number of columns in the currently processed line, so the data integrity check can spot any errors.

        old_data_n = data_n
    end
end

So it goes until the end of the file or data stream. When all data is processed, our script ends with a call to exit(0).

The code used to process STDIN when the user chooses the --remove option is similar to the --extract handler, with a small twist after the line variable initialization.

if remove_f == true
    first_f = true

    while gets
        data   = $_.chop
        data   = data.split(",")
        data_n = data.length

        if first_f
            old_data_n = data_n
               first_f = false
        end

        if data_n != old_data_n
            $stderr.print "csvt: the number of fields on the following "
                        + "line does not match the number of fields on "
                        + "the previous line\n"
            $stderr.print $_
            exit(1)
        end

        line = ""

There is an additional loop that sets the columns whose indexes are listed as arguments of --remove to "".

        remove_args.each do |column|

            if !(column < data_n)
                $stderr.print "csvt: field index out of range, "
                            + "use numbers between 0 and ", 
                              data_nf - 1, "\n"
                exit(1)
            end

            data[column] = ""
        end

The rest of the code is identical to the code in the --extract handler.

        data.each do |column|
            if column == ""
                next
            else
                line += column + ","
            end
        end

        print line[0, line.length-1], "\n"

        old_data_n = data_n
    end
end

We now have a complete script to help us filter CSV files. It may grow in the future, but for now it is quite complete. Your script plays well with other command-line Unix tools and is a well behaved Unix citizen. The complete script is here.

Make csvt Executable

Your script is working now and you could call it quits, but for greater convenience in the future, try to make an extra effort and make csvt executable, so you can type just this:

$ csvt

instead of this:

$ ruby csvt.rb

If you are using Unix, simply add this code on the first line of your script:

#!/usr/local/bin/ruby

The actual path to the ruby interpreter binary might be different on your system. The easiest way to find out is to use the locate or which command:

$ locate ruby
$ which ruby

If either fails, use find

$ find / -name "ruby"

This might take a while because find is searching the whole directory tree. Once you know the access path to the ruby binary, paste it after #! and save the script to disk. Remember that you need place these instructions on the very first line of your script or the shell will not be able to recognize it as a request to use the Ruby interpreter. If you need to list options for the interpreter, you can list them, but remember that there is no need to list the name of the script itself.

Now save csvt to disk, and make it executable with $ chmod u+x csvt.

The u+x argument tells chmod to mark csvt as executable only by the owner of the script (that would be you ...). Other possibilities include g+x, which marks the script as executable by all members of the group that the script is assigned to (ls -l reveals the script's group); o+x, which would make the script executable by all other users (not a good idea); finally, a+x would make it executable by all users (this should be avoided as well).

Note that neither the #! notation nor chmod command can be used in the Microsoft Windows environment unless you install the Cygwin package, which turns Windows into a pretty good Unix environment look-and-feel-alike. When installing Cygwin is not an option, you can still use csvt, but it must be preceded with the ruby command, as in ruby csvt -e file instead of csvt -e file.

Resources

The following places should be on the list of favorite destinations for everyone learning and using Ruby:

Books

If you want to enhance your knowledge of Ruby, you should take a look at Ruby in a Nutshell from O'Reilly or Programming Ruby from Addison-Wesley. Safari has at least half a dozen Ruby titles, from O'Reilly as well as other publishers.

Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.


Return to ONLamp.com.


Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!


Linux Resources
  • Linux Online
  • The Linux FAQ
  • linux.java.net
  • Linux Kernel Archives
  • Kernel Traffic
  • DistroWatch.com


  • Sponsored by: