Web DevCenter
oreilly.comSafari Books Online.Conferences.
MySQL Conference and Expo April 14-17, 2008, Santa Clara, CA

Sponsored Developer Resources

Web Columns
Adobe GoLive
Essential JavaScript
Megnut

Web Topics
All Articles
Browsers
ColdFusion
CSS
Database
Flash
Graphics
HTML/XHTML/DHTML
Scripting Languages
Tools
Weblogs

Atom 1.0 Feed RSS 1.0 Feed RSS 2.0 Feed

Learning Lab






More Spidering Hacks
Pages: 1, 2

Hack #78: Super Word Lookup

Working on a paper, book, or thesis and need a nerdy definition of one word, and alternatives to another?



You're writing a paper and getting sick of constantly looking up words in your dictionary and thesaurus. As most of the hacks in this book have done, you can scratch your itch with a little bit of Perl. This script uses the dict protocol (http://www.dict.org/) and Thesaurus.com (http://www.thesaurus.com/) to find all you need to know about a word.

By using the dict protocol, DICT.org and several other dictionary sites make our task easier, since we do not need to filter through HTML code to get what we are looking for. A quick look through CPAN (http://www.cpan.org/) reveals that the dict protocol has already been implemented as a Perl module (http://search.cpan.org/author/NEILB/Net-Dict/lib/Net/Dict.pod). Reading through the documentation, you will find it is well-written and easy to implement; with just a few lines, you have more definitions than you can shake a stick at. Next problem.

Unfortunately, the thesaurus part of our program will not be as simple. However, there is a great online thesaurus (http://www.thesaurus.com/) that we will use to get the information we need. The main page of the site offers a form to look up a word, and the results take us to exactly what we want. A quick look at the URL shows this will be an easy hurdle to overcome — using LWP, we can grab the page we want and need to worry only about parsing through it.

Since some words have multiple forms (noun, verb, etc.), there might be more than one entry for a word; this needs to be kept in mind. Looking at the HTML source, you can see that each row of the data is on its own line, starting with some table tags, then the header for the line (Concept, Function, etc.), followed by the content. The easiest way to handle this is to go through each section individually, grabbing from Entry to Source, and then parse out what's between. Since we want only synonyms for the exact word we searched for, we will grab only sections where the content for the entry line contains only the word we are looking for and is between the highlighting tag used by the site. Once we have this, we can strip out those highlighting tags and proceed to finding the synonym and antonym lines, which might not be available for every section. The easiest thing to do here is to throw it all in an array; this makes it easier to sort, remove duplicate words, and display it. In cases in which you are parsing through long HTML, you might find it easier to put the common HTML strings in variables and use them in the regular expressions; it makes the code easier to read. With a long list of all the words, we use the Sort::Array module to get an alphabetical, and unique, listing of results.

The Code

Save the following code as dict.pl:

#!/usr/bin/perl -w
#
# Dict - looks up definitions, synonyms and antonyms of words.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
use LWP;
use Net::Dict;
use Sort::Array "Discard_Duplicates";
use URI::Escape;

my $word = $ARGV[0]; # the word to look-up
die "You didn't pass a word!\n" unless $word;
print "Definitions for word '$word':\n";

# get the dict.org results.
my $dict = Net::Dict->new('dict.org');
my $defs = $dict->define($word);
foreach my $def (@{$defs}) {
    my ($db, $definition) = @{$def};
    print $definition . "\n";
}

# base URL for thesaurus.com requests
# as well as the surrounding HTML of
# the data we want. cleaner regexps.
my $base_url       = "http://thesaurus.reference.com/search?q=";
my $middle_html    = ":</b>&nbsp;&nbsp;</td><td>";
my $end_html       = "</td></tr>";
my $highlight_html = "<b style=\"background: #ffffaa\">";

# grab the thesaurus results.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url" . uri_escape($word))->content;

# holders for matches.
my (@synonyms, @antonyms);

# and now loop through them all.
while ($data =~ /Entry(.*?)<b>Source:<\/b>(.*)/) {
    my $match = $1; $data = $2;

    # strip out the bold marks around the matched word.
    $match =~ s/${highlight_html}([^<]+)<\/b>/$1/;

    # push our results into our various arrays.
    if ($match =~ /Synonyms${middle_html}([^<]*)${end_html}/) {
        push @synonyms, (split /, /, $1);
    }
    elsif ($match =~ /Antonyms${middle_html}([^<]*)${end_html}/) {
        push @antonyms, (split /, /, $1);
    }
}

# sort them with sort::array,
# and return unique matches.
if ($#synonyms > 0) {
    @synonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@synonyms,
    );

    print "Synonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@synonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n\n";
}

# same thing as above.
if ($#antonyms > 0) {
    @antonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@antonyms,
    );

    print "Antonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@antonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n";
}

Running the Hack

Invoke the script on the command line, passing it one word at a time. As far as I know, these sites know how to work with English words only. This script has a tendency to generate a lot of output, so you might want to pipe it to less or redirect it to a file.

Here is an example where I look up the word "hack":

% perl dict.pl "hack"
Definitions for word 'hack':
<snip>
hack
 
   <jargon> 1. Originally, a quick job that produces what is
   needed, but not well.
 
   2.  An incredibly good, and perhaps very time-consuming, piece
   of work that produces exactly what is needed.

<snip>
 
   See also {neat hack}, {real hack}.
 
   [{Jargon File}]
 
   (1996-08-26)
 
Synonyms for hack:
be at, block out, bother, bug, bum, carve, chip, chisel, chop, cleave, 
crack, cut, dissect, dissever, disunite, divide, divorce, dog, drudge, 
engrave, etch, exasperate, fashion, form, gall, get, get to, grate, grave, 
greasy grind, grind, grub, grubber, grubstreet, hack, hew, hireling, incise, 
indent, insculp, irk, irritate, lackey, machine, mercenary, model, mold, 
mould, nag, needle, nettle, old pro, open, part, pattern, peeve, pester, 
pick on, pierce, pique, plodder, potboiler, pro, provoke, rend, rip, rive, 
rough-hew, sculpt, sculpture, separate, servant, sever, shape, slash, slave, 
slice, stab, stipple, sunder, tear asunder, tease, tool, trim, vex, whittle, 
wig, workhorse
 
Antonyms for hack:
appease, aristocratic, attach, calm, cultured, gladden, high-class, humor, 
join, make happy, meld, mollify, pacify, refined, sophisticated, superior, 
unite

Hacking the Hack

There are a few ways you can improve upon this hack.

Using specific dictionaries
You can either use a different dict server or you can use only certain dictionaries within the dict server. The DICT.org server uses 13 dictionaries; you can limit it to use only the 1913 edition of Webster's Revised Unabridged Dictionary by changing the $dict->define line to:

my $defs = $dict->define($word, 'web1913');

The $dict->dbs method will get you a list of dictionaries available.

Clarifying the thesaurus
For brevity, the thesaurus section prints all the synonyms and antonyms for a particular word. It would be more useful if it separated them according to the function of the word and possibly the definition.

Adam Bregenzer

Kevin Hemenway is the coauthor of Mac OS X Hacks, author of Spidering Hacks, and the alter ego of the pervasively strange Morbus Iff, creator of disobey.com, which bills itself as "content for the discontented."

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.


Return to the Web Development DevCenter.