It's amazing what you can do with the available tools these days! A couple of weeks ago I stumbled upon a
list of funny anagrams and I thought that it would be fun to generate my own. (Did you know that there is actually
a name for "someone who creates anagrams" - who knew?). What I needed was a tool which would allow me to explore possible anagrams by specifying a word I want in the phrase and a word I want in the anagram of the phrase. For example, I wanted to be able to say that I want an anagram for "America" and "rich" and find phrases containing "America" and their corresponding anagrams containing "rich".
I did find some anagram generators on the web, but all of the ones I found required that you specify the entire phrase up front and then just rearranged the letters, but did not allow specifying just a subset of the phrase. Such a tool does take some of the work out of anagramming (who knew this is actually a word too), but the main challenge still remains to come up with the original phrase to begin with.
But then I thought, hey I have all the tools I need to do something like this myself! So this became my little weekend big data project. It resulted in
http://www.anagramfinder.net/. Could still use more work, of course, but after just a few minutes of playing with it I was able to come up with some non-trivial anagrams, like:
rich ornamentation | | not in north America |
restaurant food | | store fat around |
vampire hunter | | never a triumph |
Want to make your own anagram finder? It's real easy! Here's what you will need:
Google Books Ngram Dataset

Amazon Web Services

Hadoop

Lucene

Step 1 - Input Dataset: Google Books Ngrams
First of all, we need to ask the question: what constitutes a valid English phrase? Of course, there are many ways to answer this question. I'm going to go with the opportunistic approach and say that a valid English phrase is any sequence of words up to five words long which appeared in an English book somewhere... and that book got scanned by Google at some point. That's opportunistic, because the dataset of all such phrases - "ngrams" - just happens to be made publicly available by Google.
The only problem is that this dataset is some 800GB in size if you were to try to download it from the Google website. Not a good idea to try this at home. (I was actually careless enough to start attempting it, but luckily I have a husband with a good head on his shoulders... and one who has by now been trained to react to my mumbling things like "cool dataset" or "now if I can only get the shell script to download it in parallel..." - and he stopped me before I got us banned for life from our ISP.)
But that's were Amazon Web Service (AWS) comes to the rescue. Not only does it host the dataset in its cloud, but it provides a Hadoop platform - Elastic MapReduce (EMR) - which you can use to crunch it!
Step 2 - MapReduce
Next we need to process the ngrams, so as to group ngrams which are anagrams of each other together. We also need to collapse multiple occurrences of the same ngram, since the dataset has separate entries for different ngram-year pairs. Finally, we would like to cleanse the dataset a bit, so as to keep only the ngrams which are composed of letters and spaces, since there are some ngrams in there which contain symbols and we don't want to worry about those. We also don't need to retain groups of ngrams which contain only one ngram, since naturally those won't lead to any anagram pairs.
This is a textbook example of a job for MapReduce (implementation left as an exercise to the reader). My implementation which was pretty straightforward and took about 8 hours to run on 20 small instances on Elastic MapReduce, so cost me $16 (I could probably have saved some money by bidding on spot instances, but I didn't bother, since it was a fairly small job.)
The nice thing about using AWS was that I was able to test my job locally before submitting it to the cluster, since EMR runs standard Hadoop 0.20. So I was able to write my map reduce job in Java and test it on my local machine using as input just one of the files downloaded from the Google dataset. Then I just uploaded my Java jar file to Amazon Simple Storage Service (S3) and used the S3 paths of the public Ngrams dataset as input and my private S3 bucket as output. The only thing to note is that the public dataset is available in the form of sequence files, whereas the files you download are text files, so you have to switch your job from TextInputFormat to SequenceFileInputFormat when running it on AWS, but that was the only difference.
The output of the job turned out only 8GB, because the original dataset contains a lot more information than is needed - probably mainly the breakdown of ngram frequencies by year. That's something you can download to your local machine for further processing, to make things simpler.
Step 3 - Lucene Index
The next step is indexing. The MapReduce job produced ngram groups which are all pairwise anagrams of each other. Each such group will be a Lucene document. Each document has two fields: the Words field and the Ngrams field.
The Words field contains all of the words in all of the ngrams in the ngram group. This field is indexed, but not stored, so it allows us to find all of the documents (ngram groups) which contain a given word. I additionally associated a payload with each word proportional to the frequency of the most frequent ngram in the group which the word belonged to and used that to boost the relevance of documents in the search. This caused frequent ngrams to bubble up to the top in the results. I'm not sure if that's the best relevance function to use, but I just wanted to play with payloads. Figuring out a better relevance score is definitely something to think about.
The Ngrams field is stored and not indexed and contains all of the ngrams in the group. This means that we can't search by it, but we can use it to retrieve all of the ngrams in a given document once we have found the document by searching for words.
While playing with the dataset, I found that there are a lot of trivial subgroups of each ngram group which contain the same words, but in different order, for example:
this is it
is this it
is it this
Those lead to a lot of cognitive overhead when browsing through results, so I decided to keep only one (the most frequent) ordering of words from each such group. This also reduces the index size and makes search faster.
The indexing process took 42 minutes on my machine and resulted in an index of 5.5GB (Lucene is super efficient in the way stores things.) The index stores a total of 127,627,663 ngrams in 58,581,588 groups.
Step 4 - Lucene Search
Once the index is built, search is simple. First we identify all the documents which contain all of the words the user requested by performing a Boolean query on the Words field. Then, for each such document, we retrieve its ngrams by retrieving the contents of the Ngrams field. We scan the ngrams to identify those which contain all of the words the user requested to be in the left phrase and all the ngrams which the user requested to be in the right phrase and return the Cartesian product of those two sets.
Now to be completely honest, we should evaluate the relevance (for whatever relevance measure we decide on) of each such pair separately and reorder the results accordingly, instead of just relying on the relevance of each document. But this again, I left to "future work".
Final Result
As mentioned earlier, this is the anagram finder I ended up with: http://www.anagramfinder.net. (No guarantees that it's up, though, since I'm only paying for one AWS micro instance to host it.) Can you make a better one? Give it a try - it's fun :)
We've all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that's the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.
So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it's not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.
So what went wrong?
The problem with Hadoop is that it is relatively easy to get started using it without an in-depth knowledge of what gives it its powers and without this, you are more likely than not to design your solution in a way which takes all of those powers away. So let's take a look at the few key features of Hadoop and what not to do to keep them.
Feature: Efficiency
There are a couple of things Hadoop does to ensure efficiency:
- It brings the computation to the data. So instead of sending large amounts of data over the network to the machines which execute the computations, it tries to run the computations directly on the nodes which contain the required inputs.
- It processes files sequentially, reducing the number of costly disk seeks. Unless the Hadoop cluster is running on SSD hard drives, it will take it order of 10ms to seek to a new place on the disk. On the other hand during the same amount of time, it can read an order of 10 megabits of data from the disk sequentially. So you can't process large amounts of data if the process involves frequent seeks.
- It uses compression, memory buffers and other optimizations to make the data flow in the system very efficient.
Do not: Physically separate your data cluster from your compute cluster. Whether your data is coming from HDFS, HBase, Cassandra or somewhere else, if it is not stored on the same machines as the MapReduce jobs are running on, it is impossible for Hadoop to bring the computation to the data. Cases exist when such a solution is acceptable, in particular when the jobs are more compute intensive than data-intensive (they do a lot of computing and not a whole lot of reading and writing data) and Hadoop can still be a good option for such jobs, but it's full potential is with jobs which process large datasets with relatively simple computations.
Do not: Create location inaware input formats. Often the input formats which come with the data storage you are using do not exactly fit your system and you have to write your own. Make sure that as you do that, you implement location-awareness by implementing a non-empty InputSplit.getLocation() or by inheriting this functionality from a superclass if you are extending an existing input format.
Do not: Read or write external data in the map or reduce tasks (apart from reading the input and writing output as orchestrated by the framework). For example, you may be tempted to make database read or write calls directly in the map or reduce code and there's nothing to prevent you from doing it. But especially if you are doing a lot of small random lookups, you are losing all of the optimization the framework provides for the efficient streaming of data through the system. Also, if you are accessing a limited availability resource, such as a SQL database, you may be introducing a bottleneck which prevents the solution from scaling (see Scalability).
Do not: Write code to copy files around manually. Once again, the Hadoop framework does a lot to optimize the flow of data during a compute job, but copying files around manually will always incur heavy network traffic (as data is replicated) and cannot be optimized since the system does not know what it is you are trying to do. Apart from the input/output mechanism, Hadoop also offers a distributed cache which can be used to bring data to your tasks. Those provided mechanisms are usually the most efficient way of bringing data and computation together.
Feature: Scalability
Hadoop is highly scalable in the sense that you can grow your cluster to many thousands of nodes and the computing throughput increases linearly with it. That's because Hadoop doesn't have bottlenecks. But that doesn't mean that a system implemented on top of Hadoop automatically doesn't have bottlenecks either.
Do not: Synchronize tasks. Since tasks are executing in parallel, it may be tempting to add synchronization between them and with tools such as ZooKeeper it is fairly easy to add all sorts of distributed synchronization mechanisms such as locks or queues. But every synchronization is a bottleneck, so workflows using those almost never scale.
Do not: Use a constant number of mapper or reducer tasks. The number of map tasks a job is split into is proportional to the size of the input, so the scaling of the number of mappers comes naturally as the size of the input grows (as long as you remember about this if implementing a custom input format). But the number of reducers is defined by the programmer. Since the number of reducers also defines the partitioning of the output, it may be tempting to keep it constant or maybe even always keep just one reducer, so that the output is just a single file. But if you do this, you are preventing the system from scaling out as the amount of data and the cluster grows.
Do not: Talk to the job tracker. It may be tempting to have tasks talk directly to the job tracker which scheduled the job to find out extra information about the job or the other tasks. For example, you may be tempted to have the reducer ask the job tracker for the total number input records, in order to turn the classic word count example into an IDF calculation. But the job tracker is a single resource, so such practices could prevent your job from scaling and could also give misleading results, since the job tracker may not always be serving up the latest statistics as the job is still running.
Feature: Reliability
The Hadoop MapReduce framework handles failures gracefully. If a node in the cluster fails mid-task, it will just re-run the task on a different node. It also hedges against having to wait for slower nodes by starting tasks on multiple machines to begin with (what is referred to as "speculative execution"). But if you are not aware of these mechanisms, your system may end up not functioning as expected.
Do not: Write an output format without considering output committer functionality. If using FileSystemOutputFormat or its derivatives, only the output of the successful tasks makes it to the output directory of the job and the rest are discarded. This logic is handled in by the FileOutputCommitter. Some custom formats may not need committers, for example if they are writing to a key-value store which automatically keeps only one value per key, but in other cases, failed and subsequently restarted jobs, or tasks scheduled speculatively, could result in output data inconsistencies, such as duplicate records.
Do not: Write tasks with side effects. If a task has side effects, for example if it writes to an external data store directly, then retried tasks or tasks scheduled concurrently through speculative execution, may end up not functioning as desired.
We live in a unique time when a number technologies and trends have finally come together. I'm talking about the big data overload, cloud computing, machine learning and corporation-backed open source initiatives. I predict that all this will soon result in a revolution... a shift in the way we live perhaps even as profound as the one brought on by the birth of the Internet. I'm thrilled to watch that unfold. I have a chance of doing so by working at a company which is part of this trend (the second one in a row by now) and will use this blog to share some of the things I have learned so far. I suspect most of it will be technical and having to do with emerging technologies such as Hadoop, Lucene, or NoSQL. Maybe some of it will be more philosophical... I don't know. Anyway, thanks for visiting and hope you enjoy what you find here!
|