ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


OpenGuides: City Wikis in Perl

by Kake Pugh
07/05/2007

Three and a half years ago, Perl.com published an article of mine describing the very beginnings of the OpenGuides project—an open source web application written in Perl, initially aimed at overcoming the limitations of the UseMod software.

From this modest beginning, we've developed a complete wiki toolkit called, unsurprisingly, Wiki::Toolkit and a custom-built web application, OpenGuides, which provides structure and a UI layer on top of that toolkit. Development is ongoing, and the small core team of programmers has grown to include people of all levels of expertise. As new technologies such as RDF, proper CSS support, Ajax, and the Google Maps API have become available, we've taken the useful bits and incorporated them into OpenGuides.

The improvements have not been only technical; the OpenGuides mailing lists and IRC channel have developed into a close-knit community including programmers, testers, guide admins, and even the more prolific contributors to the various guides. We've held meetups, hackfests, and even pub crawls. It wouldn't be an exaggeration to say that the existence of OpenGuides has made my life better, and I doubt I'm the only one who can say that.

The Growth of the Project

All we wanted, back then in 2002, was to have something we could use to let us write up everything we knew about London—our favorite pubs and restaurants, our insider knowledge of the quirks of its public transport system, our top tips for places to buy knitting yarn. It was only after we started working on our custom software that we realized other people might be able to use it as well and that it might give people living in other cities a custom-built and well-tailored way to write about their own neighborhoods.

As other people learned about our project, new OpenGuides sites sprang up here and there. One of the first non-London cities covered was Oxford, which still boasts two OpenGuides-based sites, one catering specifically to vegans and the other a more general guide to the city. Later additions included Boston and Saint Paul/Minneapolis in the US; Vienna, Oslo, and Bologna in Europe; and Milton Keynes, the Cotswolds, Birmingham, Norwich, and many more in the U.K. While some of these have fallen by the wayside, others are still going strong.

Since OpenGuides isn't really in the same situation as most open source projects—we don't have thousands or even hundreds of direct users, due to the nature of the project—much of its development has been driven by the needs of the individual guides. Essentially, although there may be a large number of people who use the sites built on OpenGuides, any feature requests or niggles that end users have regarding an OpenGuide site go first of all to their local guide's admin team, who can often fix the problem themselves, perhaps by adding to the local documentation, by tweaking the config file or stylesheet, or simply by upgrading the software. Hence, feature requests and bug reports that make it through to the core team tend to be well thought out and carefully described. This means we're very likely to take them seriously!

Design Issues

One major issue has always been that of design. While there's a convincing argument that keeping the design consistent across a family of web sites is good because it means people who're used to one of these sites find it easy to contribute to all the others (the approach taken by most sites running on the MediaWiki software), this is perhaps less of an issue for very local sites like most of those running on OpenGuides. Also, guide admins kept asking for more flexibility in the design.

We're using the Template Toolkit to generate all our HTML, so in theory people could just edit the templates themselves; unfortunately, we ran into various problems with this, not least of them the fact that presentation logic is still logic, and hence can have bugs. The dual solution we came up with (and are still working toward) was first to split our monolithic templates into smaller snippets and make it explicit which ones are safely editable; and second, to base our HTML on the philosophy of the CSS Zen Garden—to make the HTML plain, clear, and semantic, and then tweak the colors, widths, and placement by means of CSS.

The other advantage of cleaning up our template files like this is that it makes it easier to distribute templates in multiple languages. Again, this is something we're still working on, mainly because the demand for it hasn't previously been as great as the demand for other features.

Dealing with Spam

Another issue which appeared in the years since we began working on OpenGuides is the problem of wikispam. Wikis are hugely popular with spammers who want to increase their page rank, since wikis in general tend to have high page rank. The freely editable nature of a wiki means that, unless you have some defense, you can find your lovely web site covered in porn spam in a matter of minutes.

While OpenGuides already has a few anti-spam defenses—retroactive moderation by means of page and page version deletion, and proactive moderation for specific pages—we're currently working on an additional and completely customizable feature whereby a guide admin can choose to plug in her own spam-detection module, which is called before any page is written to the database. If this module says "yes, that's spam," the edit is refused and the user is notified. We'll be writing and distributing various modules for plugging in here, but if an admin wants to write her own, she can do whatever she likes, from a simple regex match on the content (or the categories, locales, username, IP address, etc.), to using Net::Akismet or similar and logging every refused edit for later perusal by admins.

The guide I contribute to, the Randomness Guide to London, is already running on the development code that includes the new pluggable anti-spam measures. In the month since we've been using these measures, we've caught around 1,500 spam edits (with no false positives). One reason for this success is that we've been keeping an eye on the spam that does get through, and tweaking our spam detection module as appropriate.

The code on the OpenGuides side is pretty simple; prior to accepting any edit, OpenGuides checks its config file to see if a spam detection module has been specified. If so, and if the module is loadable, then its looks_like_spam method is called to return a true or false value indicating whether this edit should be considered spam:

# If we can, check to see if this edit looks like spam.
my $spam_detector = $config->spam_detector_module;
my $is_spam;
if ( $spam_detector ) {
    eval {
        eval "require $spam_detector";
        $is_spam = $spam_detector->looks_like_spam(
            node    => $node,
            content => $content,
            metadata => \%new_metadata,
        );
    };
}

If an edit does look like spam, the editor is informed of this fact, and the edit is not saved:

if ( $is_spam ) {
    my $output = OpenGuides::Template->output(
        wiki     => $self->wiki,
        config   => $config,
        template => "spam_detected.tt",
        vars     => {
                      not_editable => 1,
                    },
    );
    return $output if $return_output;
    print $output;
    return;
}

The name of the page, the main (freeform) content, and the structured data associated with the page (the metadata) are all passed to the looks_like_spam method, allowing fine-grained spam detection. One of the most prevalent types of wikispam is an edit with the changelog comment of "Some grammatical corrections." This is easy to match:

sub looks_like_spam {
    my ( $class, %args ) = @_;
    my $comment = $args{metadata}{comment};
    if ( $comment =~ /some grammatical corrections/i ) {
        return 1;
    }
}

OpenGuides itself simply discards the attempted edit, leaving it up to the author of the spam detection module to decide on the most appropriate method of logging the attempt and notifying the guide administrators. On the Randomness Guide to London, we use Email::Send to email us all the details:

use Data::Dumper;
use Email::Send;

sub looks_like_spam {
    my ( $class, %args ) = @_;

    my $content = $args{content};
    if ( $content =~
                 /\b(viagra|cialis|tramadol|vicodin)\b/is ) {
        $class->notify_admins( %args,
                               reason => "Matches $1" );
        return 1;
    }
}

sub notify_admins { 
     my ( $class, %args ) = @_; 
    my $datestamp = localtime( time() ); 
    my $message = <<EOM; 
 From: kake\@earth.li 
 To: kake\@earth.li, bob\@randomness.org.uk 
Date: $datestamp 
 Subject: Attempted spam edit on RGL

Someone just tried to edit RGL, and I said no because it
looked like spam.  Here follows a dump of the details:

EOM
    $message .= Dumper( \%args );

    my $sender = Email::Send->new( { mailer => "SMTP" } );
    $sender->mailer_args( [ Host => "localhost" ] );
    $sender->send( $message );
}

Getting the Data Back out Again

So, what else makes OpenGuides different from other wiki software? Why would someone choose OpenGuides to run her city guide rather than, say, MediaWiki, or Kwiki, or MoinMoin?

OpenGuides' geographical awareness is certainly an advantage—having distance searching and Google map support built-in is very handy. Perl programmers will also appreciate the fact that it's written in Perl, of course! But the big win, as far as I'm concerned, is its use of structured data. Not only does this allow us to build complex queries like, "find me all the real ale pubs within 500m of King's Cross station that serve food at lunchtime," or "find me all the restaurants that Kake wants to be taken to that are within 500m of any Jubilee Line station," it also makes it easier for people to contribute content to the guides. A number of people have told me that they really appreciate seeing some structure on the edit page rather than just a big white blank box—I suppose it's the wiki equivalent of writer's block!

In theory, the possibilities for structured data in OpenGuides are endless, since the underlying Wiki::Toolkit software, which handles all data storage and output for OpenGuides, puts no restrictions on what kind of data can be stored. The most useful structured data fields at the moment are categories, locales, latitude, and longitude; although address, postal code, etc., are also stored in structured data fields, this is mainly a presentation and usability issue.

As well as the search and update tools incorporated within OpenGuides itself, individual guide admins are also free to write custom search tools. The Randomness Guide has a number of custom search scripts, including various tools aimed at making life easier for the guide's contributors, such as a way of finding all stub pages (pages with very little content), without those pages needing to be specifically marked as such.

OpenGuides data is accessible to programmers in a number of ways. Firstly, the OpenGuides modules have a number of methods which can be called externally; for example, text search results can be returned as a Perl data structure (or indeed a hash of Template Toolkit variables) instead of as a nicely formatted HTML page. If you'd like to get closer to the wires, the underlying Wiki::Toolkit modules offer additional methods, allowing you, for example, to grab a list of all categories that a given page has been placed in, or a list of all pages within a certain distance of a certain latitude/longitude. If that's not good enough (or is too slow), the fact that all OpenGuides data is structured, and is stored in an SQL database (your choice of Postgres, MySQL, or SQLite) means that even the most baroque questions can be answered simply by writing some SQL.

For example, when writing the stub page finder mentioned above, I went back to the SQL:

my $sql = "
SELECT node.name,
       locale.metadata_value,
       category.metadata_value
FROM node
LEFT JOIN metadata as locale
  ON ( node.id = locale.node_id
       AND node.version = locale.version
       AND lower( locale.metadata_type ) = 'locale'
     )
LEFT JOIN metadata as category
  ON ( node.id = category.node_id
       AND node.version = category.version
       AND lower( category.metadata_type ) = 'category'
     )
WHERE char_length( node.text ) < ?
AND node.text NOT LIKE '%#REDIRECT%'
";

if ( $q->param( "exclude_locales" ) ) {
    $sql .= " AND node.name NOT LIKE 'Locale %'";
}
if ( $q->param( "exclude_categories" ) ) {
    $sql .= " AND node.name NOT LIKE 'Category %'";
}

my $sth = $dbh->prepare( $sql );
$sth->execute( $length ) or die $dbh->errstr;

Conversely, when writing a little widget to find the nearest Tube (subway) station to a given place, I made use of a couple of Wiki::Toolkit methods, first to find everything within a kilometre of the place, and second to get a list of all Tube stations; the simple intersection of these arrays gives me my answer:

my $config_file = $ENV{OPENGUIDES_CONFIG_FILE}
                  || "../wiki.conf";
my $config = OpenGuides::Config->new( file => $config_file );

my $guide = OpenGuides->new( config => $config );
my $wiki = $guide->wiki;

my $locator = Wiki::Toolkit::Plugin::Locator::Grid->new(
    x => "os_x", y => "os_y" );
$wiki->register_plugin( plugin => $locator );

[...]

my @nearby = $locator->find_within_distance(node => $origin,
                                            metres => 1000 );
my @stations = $wiki->list_nodes_by_metadata(
    metadata_type  => "category",
    metadata_value => "tube",
    ignore_case    => 1,
);

The Data Determines the Structure

While OpenGuides does impose some structure, we deliberately left the choice of categories, locales, and house style to contributors to each guide. Although this makes it slightly harder to automatically transfer data between guides, it's more important to make it easier for local people to create the kind of guide that they find most useful. For example, the Randomness Guide uses postal districts (W1, WC2, SE1, etc.) in addition to named locales such as Hammersmith or Marylebone, because Londoners are used to navigating by postcode. The two Oxford guides have also adapted to their specific locality, and created a number of locales which are restricted to a single street; Oxford is so small that this makes a lot of sense.

Freedom of category choice also makes it easier to write custom search scripts; one of the more popular searches on the Randomness Guide is the pub search, which takes advantage of a number of categories that the guide's users have come up with—Real Ale, Real Cider, Food Served Lunchtimes, Food Served Evenings, Free Wireless, and so on.

One serendipitous outcome of the category system stemmed from our use of the Google Maps API, which allows us to plot search results on a map. We created a category for each of the Transport For London Travelcard zones (I used the WikiMedia Commons Tube station data and a WWW::Mechanize script for the Tube stations, though the rail stations had to be done by hand), which gave us, with no additional work, some strangely fascinating maps of the extent of each zone: 1, 2, 3, 4, 5, and 6 (best viewed in tabs).

How to Get Involved

There are many ways to get involved in OpenGuides. First, take a look at openguides.org to see if there's a guide local to you. If there is, take a look and see if you can improve or add to any of the information on it! If there's no guide covering your area, you might even be interested in setting one up yourself—if this is the case, the best way to start is by joining the openguides-dev mailing list, and posting there about your interest. People on the list have years of experience of setting up and maintaining an Open Guide, and will be happy to guide you through the technical and social issues involved.

Finally, if you're interested in getting involved in the programming side of things, you can download the OpenGuides and Wiki::Toolkit releases from CPAN, or take a look at our Trac install, and browse or check out our subversion repository for the latest code. Then, why not come along and meet us at one of our hackfests? The next one will be held in London on Saturday and Sunday July 21 and 22, and we'll also be travelling to Vienna in August to hold a short three-hour hackathon as part of YAPC::Europe.

You don't need to be an expert to work on OpenGuides, whether as a content contributor or a programmer; we have people of all levels of expertise involved in the project, and there's certainly no shortage of things to do! Our friendly community is full of helpful people, and it's growing all the time—most appropriate for a project that started off with a conversation between a couple of friends in a pub.

Kake Pugh is a freelance academic copy editor who writes Perl in her spare time. She likes test-first development, databases, and documentation.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.