Sunday, September 18, 2011

Anagram Finder: A Do-It-Yourself Little Big Data Project

It's amazing what you can do with the available tools these days! A couple of weeks ago I stumbled upon a list of funny anagrams and I thought that it would be fun to generate my own. (Did you know that there is actually a name for "someone who creates anagrams" - who knew?). What I needed was a tool which would allow me to explore possible anagrams by specifying a word I want in the phrase and a word I want in the anagram of the phrase. For example, I wanted to be able to say that I want an anagram for "America" and "rich" and find phrases containing "America" and their corresponding anagrams containing "rich".

I did find some anagram generators on the web, but all of the ones I found required that you specify the entire phrase up front and then just rearranged the letters, but did not allow specifying just a subset of the phrase. Such a tool does take some of the work out of anagramming (who knew this is actually a word too), but the main challenge still remains to come up with the original phrase to begin with.

But then I thought, hey I have all the tools I need to do something like this myself! So this became my little weekend big data project. It resulted in http://www.anagramfinder.net/. Could still use more work, of course, but after just a few minutes of playing with it I was able to come up with some non-trivial anagrams, like:

rich ornamentationnot in north America
restaurant foodstore fat around
vampire hunternever a triumph

Want to make your own anagram finder? It's real easy! Here's what you will need:

Google Books Ngram Dataset


Amazon Web Services


Hadoop


Lucene


Step 1 - Input Dataset: Google Books Ngrams

First of all, we need to ask the question: what constitutes a valid English phrase? Of course, there are many ways to answer this question. I'm going to go with the opportunistic approach and say that a valid English phrase is any sequence of words up to five words long which appeared in an English book somewhere... and that book got scanned by Google at some point. That's opportunistic, because the dataset of all such phrases - "ngrams" - just happens to be made publicly available by Google.

The only problem is that this dataset is some 800GB in size if you were to try to download it from the Google website. Not a good idea to try this at home. (I was actually careless enough to start attempting it, but luckily I have a husband with a good head on his shoulders... and one who has by now been trained to react to my mumbling things like "cool dataset" or "now if I can only get the shell script to download it in parallel..." - and he stopped me before I got us banned for life from our ISP.)

But that's were Amazon Web Service (AWS) comes to the rescue. Not only does it host the dataset in its cloud, but it provides a Hadoop platform - Elastic MapReduce (EMR) - which you can use to crunch it!

Step 2 - MapReduce

Next we need to process the ngrams, so as to group ngrams which are anagrams of each other together. We also need to collapse multiple occurrences of the same ngram, since the dataset has separate entries for different ngram-year pairs. Finally, we would like to cleanse the dataset a bit, so as to keep only the ngrams which are composed of letters and spaces, since there are some ngrams in there which contain symbols and we don't want to worry about those. We also don't need to retain groups of ngrams which contain only one ngram, since naturally those won't lead to any anagram pairs.

This is a textbook example of a job for MapReduce (implementation left as an exercise to the reader). My implementation which was pretty straightforward and took about 8 hours to run on 20 small instances on Elastic MapReduce, so cost me $16 (I could probably have saved some money by bidding on spot instances, but I didn't bother, since it was a fairly small job.)

The nice thing about using AWS was that I was able to test my job locally before submitting it to the cluster, since EMR runs standard Hadoop 0.20. So I was able to write my map reduce job in Java and test it on my local machine using as input just one of the files downloaded from the Google dataset. Then I just uploaded my Java jar file to Amazon Simple Storage Service (S3) and used the S3 paths of the public Ngrams dataset as input and my private S3 bucket as output. The only thing to note is that the public dataset is available in the form of sequence files, whereas the files you download are text files, so you have to switch your job from TextInputFormat to SequenceFileInputFormat when running it on AWS, but that was the only difference.

The output of the job turned out only 8GB, because the original dataset contains a lot more information than is needed - probably mainly the breakdown of ngram frequencies by year. That's something you can download to your local machine for further processing, to make things simpler.

Step 3 - Lucene Index

The next step is indexing. The MapReduce job produced ngram groups which are all pairwise anagrams of each other. Each such group will be a Lucene document. Each document has two fields: the Words field and the Ngrams field.

The Words field contains all of the words in all of the ngrams in the ngram group. This field is indexed, but not stored, so it allows us to find all of the documents (ngram groups) which contain a given word. I additionally associated a payload with each word proportional to the frequency of the most frequent ngram in the group which the word belonged to and used that to boost the relevance of documents in the search. This caused frequent ngrams to bubble up to the top in the results. I'm not sure if that's the best relevance function to use, but I just wanted to play with payloads. Figuring out a better relevance score is definitely something to think about.

The Ngrams field is stored and not indexed and contains all of the ngrams in the group. This means that we can't search by it, but we can use it to retrieve all of the ngrams in a given document once we have found the document by searching for words.

While playing with the dataset, I found that there are a lot of trivial subgroups of each ngram group which contain the same words, but in different order, for example:
this is it
is this it
is it this
Those lead to a lot of cognitive overhead when browsing through results, so I decided to keep only one (the most frequent) ordering of words from each such group. This also reduces the index size and makes search faster.

The indexing process took 42 minutes on my machine and resulted in an index of 5.5GB (Lucene is super efficient in the way stores things.) The index stores a total of 127,627,663 ngrams in 58,581,588 groups.

Step 4 - Lucene Search

Once the index is built, search is simple. First we identify all the documents which contain all of the words the user requested by performing a Boolean query on the Words field. Then, for each such document, we retrieve its ngrams by retrieving the contents of the Ngrams field. We scan the ngrams to identify those which contain all of the words the user requested to be in the left phrase and all the ngrams which the user requested to be in the right phrase and return the Cartesian product of those two sets.

Now to be completely honest, we should evaluate the relevance (for whatever relevance measure we decide on) of each such pair separately and reorder the results accordingly, instead of just relying on the relevance of each document. But this again, I left to "future work".

Final Result

As mentioned earlier, this is the anagram finder I ended up with: http://www.anagramfinder.net. (No guarantees that it's up, though, since I'm only paying for one AWS micro instance to host it.) Can you make a better one? Give it a try - it's fun :)

3 comments:

  1. Its pretty good.Nice weekend project

    ReplyDelete
  2. Nice work. I will meet you before my internship ends

    ReplyDelete
  3. Just wondering if the method would work for following problem
    Problem! - What meaningful sentences (in English, using a dictionary) can be created using all letters from a given string such as
    ACBFHYJYTVCVFRERYTRUYTIHBNVVSDEQWRE.....DDSFDGUPUNGSASERFDFT (150 letters long)
    Any ideas?

    ReplyDelete