ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Introduction to Haskell, Part 2: Pure Functions
Pages: 1, 2

Ordering words by frequency involves a little manipulation of the list of word groups. Each word group is a list of one or more instances of a single word. An easy way to order by frequency is to convert these lists into a pair of values, where the first element is the word count (frequency) and the second element is the word itself. A simple function can perform this action on a single list of words:



makePair l = (length l, head l)

Converting a list of word groups is as simple as using map to apply this function over the entire list:

-- convert [["a", "a"], ["b"], ...]
-- into [(2, "a"), (1, "b"), ...]
map makePair wordGroups

However, makePair is deeply tied to the problem we are trying to solve and isn't very generic. Instead of creating a name for this throwaway function, we can use a lambda expression to define this function in-place:

-- same as above
map (\l -> (length l, head l)) wordGroups

The \ character starts a lambda, or an anonymous function. Named parameters are found to the left of the arrow and the body of the function is found to the right of the arrow.

Generating the list of words and their frequencies is only part of the solution. Next, the words need to be sorted in descending order:

reverse (sort (map (\l -> (length l, head l)) wordGroups))

Unfortunately, the nesting parentheses complicate the structure of this expression. We can use the $ operator to clean up the expression like so:

reverse $ sort $ map (\l -> (length l, head l)) wordGroups

The $ operator is a little bit of syntactic sugar. It takes everything to the right and treats it as a single expression, then uses the right hand side as the last parameter to the expression on the left hand side. Therefore, these two forms are actually seen by the compiler as the very same expression.

The updated top10 function now looks like this:

top10 doc = ....
  where
    listOfWords = words (lowercase doc)
    wordGroups  = group (sort listOfWords)
    frequencies = reverse
                   $ sort
                   $ map (\l -> (length l, head l))
                         wordGroups
    ...

Next, the list of (frequency, word) pairs need to be reduced to a list of words. Pairs are such a useful data structure in Haskell programs that the Prelude includes two functions to get at their contents: fst returns the first element in a pair, and snd returns the second element in a pair. To get the words out of these pairs, we need to use snd. To convert the list of pairs into a list of words, we just need to use map:

top10 doc = ....
  where
    listOfWords = words (lowercase doc)
    wordGroups  = group (sort listOfWords)
    wordPairs   = reverse
                   $ sort
                   $ map (\l -> (length l, head l))
                         wordGroups
    wordsByFreq = map snd wordPairs

Finally, the goal of the top10 function is to find the ten most frequent words in a document. Since wordsByFreq is ordered from most frequent to least frequent, all we need to do is find the first ten elements in the list. This behavior is captured in the take function in the Prelude. take has two parameters: a desired size and a list of values; it returns a list no larger than the desired size.

With this one last piece of the puzzle, here is the complete definition of the top10 function:

import Data.Char
import Data.List

lowercase = map toLower

top10 doc = take 10 wordsByFreq
  where
    listOfWords = words (lowercase doc)
    wordGroups  = group (sort listOfWords)
    wordPairs   = reverse
                   $ sort
                   $ map (\l -> (length l, head l))
                         wordGroups
    wordsByFreq = map snd wordPairs

Loading the full text of RFC 822 into a value named rfc822, and running that string through top10 produces this result:

$ ghci top10.hs
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
[1 of 1] Compiling Main             ( top10.hs, interpreted )
Ok, modules loaded: Main.

*Main> rfc822 <- readFile "rfc0822.txt"
*Main> top10 rfc822
["the","of","to","a","is","and","be",";","for","in"]
*Main>

Using this bottom-up approach to defining a function like top10, we could write a series of similar functions. For example, a version that strips leading and trailing punctuation characters on each word. Or a version that builds a unique list of words, for use inside a spell checker. Or a version that returns both words and frequencies. Or a version that ignores stopwords like the, of and to in the source text. Writing a series of these functions will likely identify common components, like a function that converts a list of words to a list of (frequency, word) pairs, or a function that converts a list of pairs into a list of words.

The list goes on, and I encourage you to play with this simple little function to explore Haskell further.

Conclusion

Many experienced programmers feel a kind of culture shock when first approaching Haskell. How can you program without mutable variables? Where are the objects? The class hierarchies? The design patterns? How do you debug a program without peppering print statements about? If pure functions can model simple mathematical concepts, can they also model real programming problems?

In this article, I've demonstrated how to develop real code using only pure functions. This approach is both powerful and different from what most programmers use today. Haskell encourages you to think in terms of small, powerful functions that do one thing and do it well. To help you in your efforts, the standard libraries provide a wealth of functions to study, reuse, and combine into interesting and novel functions that let you focus on the problems you are trying to solve, not on how to make your problem conform to an API or design pattern.

Functional programming in Haskell is a very rich, very deep topic that can't be fully explained in a single article. Visit the Haskell wiki to discover some of the more interesting topics, like partial application, higher order functions, function composition, and currying.

Articles in This Series

Adam Turoff is a software developer with more than a dozen years of experience in building web backend systems with open source tools. His current project uses a mixture of Perl, Ruby, Rails, Haskell, and other tools.


Return to ONLamp.com.



Sponsored by: