Data Bonanza: 2011

Thursday, December 8, 2011

How to lose the tech talent war

The great article The Rise of Developeronomics argues that the biggest success factor for any company nowadays, software or other, is its ability to attract and retain high-end software development talent, and that it's only going to get more so from here on out. Companies seem to get this - at least judging from the number of unsolicited recruiter mail I keep getting, but often fail miserably in retaining their engineering talent. Why is that?

The reason, it would appear, is that companies fail to see what engineers really want. They assume that their only bargaining power is cash (or other financial instruments, such as stocks or options) and perhaps benefits (but those can be translated into cash in one way or another as well). So they compete with each other on price, which has the only effect of engineering salaries going up industry-wide (not that I complain :)). But as a chunk of the commodity you are so fiercely competing for, let me share a piece of insight: salary is not what makes the big difference. Sure, money is nice and if you don't pay developers a decent amount of money, they will find somebody who does.

But salary concerns are not what usually makes developers leave their job.

And yet developers leave all the time. The turnover rate in the profession is through the roof to the point that even my dentist commented on it the other day! That's costing the industry who knows how many billions in turnover costs and lost productivity. And since that's deadweight loss (yes, I just learned a new term from my economics book), it is making all of us so much poorer!

So what does drive a good developer away? Here are a couple of things:

Closing projects and canceling products before they see the light of day

The thing about software developers, at least all the good ones, is that we work because we like to. Most of us would do it even if we didn't get paid for it. The reason we like it so much is the thrill of creating something. It's the empowering feeling of making something of value, practically out of nothing. It's really exhilarating. So imagine how we feel if we spend all our effort creating something in expectation of this thrill, only to find our work thrown out the window. It's as if a sculptor worked for years on a statue only to have the sponsor of his work demolish it before it was finished. Strange as it may sound, this happens all the time in software companies, big and small. Projects get thrown out, priorities change. I even heard of larger companies deliberately assigning the same project to multiple teams with the intent of keeping one and closing others down. These decisions are made by managers which push around project risk and budget estimates, but are those managers factoring in the cost to their human capital every closed down project is incurring?

Lack of established goals or acknowledging success

Companies often see acknowledging success as a way of directing employees to do certain things and not others. So when they see an engineering team produce good results, they don't fell like they need to acknowledge any individual team member's good work. But acknowledging success is important not just as a directing tool, but as confirmation that what we are doing is being appreciated. Yes, engineers are often needy. And vain. And insecure. We really like to be praised and told that we did well, regardless of whether the praise comes in the form of a cash bonus or just a plain old pat on the back.

But beware not to hand out praise arbitrarily, without any connection to merit, or else the developers will quickly catch on and it will lose its value. We all know who on the team did well and when. We just want to know that you know it too.

Not giving developers a chance to choose what to work on

It's a no-brainer that any employee is happier when working on what they like and having the chance on working on what you like makes you want to stay at the job and not leave. But some managers seem to think that if they don't make decisions on how to hand tasks out to people, then they are not justifying their salaries. Of course, the job of a manager or a lead is to ensure that what needs to get done, gets done. But this still leaves a whole lot of room for optimizing task assignments according to individual preferences. Unionized airlines have built very complex systems around ensuring that crew members get to fly to their individually preferable destinations, so that they can please the union which holds all the cards. If The Rise of Developeronomics is any indication, developers are the ones with the cards in a modern company, so optimizing for their work preferences is a powerful tool for winning them over. And it can even save on some managerial overhead.

No chance of advancing skills or gaining recognition in the community

Like every motivated individual, developers strive to get better at what they do. Part of this is learning new skills and improving existing ones and part of it is establishing oneself in the broader community, since in many ways the community as a whole can achieve much more than the sum of its parts. So allowing and even encouraging such activities can help keep employees happy in addition to growing their value. Companies should take this into account when making a decision not to send employees to a conference, because it would take time away from project work, or preventing an employee from blogging, in fear that it will expose sensitive information. Not to say that decisions like these are never justified, but too much repression can lead to employees feeling like they can't spread their wings and they will leave (especially the ones with wings, which are the ones you want to hold on to the most).

Lack of ownership and accountability

Very good developers are unable to write bad code, but even the best of us try harder when we know that our name is going to be on what we are working on... even more so if that means that we will have to maintain it later :) At the same time, we hate having to pay for other people's laziness, by having to maintain thier bad code. That's not to say that only one person should be looking at any given area of a project. On the contrary, code and design needs to be constantly reviewed by multiple people to ensure quality as well as skills and knowledge transfer. But maintaining clear and constant areas of ownership throughout the process leads to better work quality as well as much less frustration and happier engineers.

Exposure to bad code and bad developers

The Rise of Developeronomics talks about the cost of bad developers incurred when their bad code blows up. But there is another cost of bad developers and that is their bad effect on the good developers. I, for one, suffer when I see bad code. When I see some really bad code, I feel nauseous and need five minutes of browsing Facebook before I can bring myself to look at the IDE again. That's lost productivity right there. And then there's the fact that knowing that other pieces of the product are going to be written badly anyway, doesn't make me want to try as hard to make my pieces perfect. Finally, all the stress caused by having to debug and integrate with badly written components is a motivator to look for a different company with a better team.

So it is important for companies to acknowledge code quality in their process, as well as the individual developers' commitment to it. Note that this is not about junior employees and senior employees, because I don't see all that much correlation between code quality and seniority. It's the right mindset that matters and junior developers with the right mindset will intuitively write high quality code from the start, while sloppy developers will not only poison your code base, but drive all of the good developers away.

Sunday, September 18, 2011

Anagram Finder: A Do-It-Yourself Little Big Data Project

It's amazing what you can do with the available tools these days! A couple of weeks ago I stumbled upon a list of funny anagrams and I thought that it would be fun to generate my own. (Did you know that there is actually a name for "someone who creates anagrams" - who knew?). What I needed was a tool which would allow me to explore possible anagrams by specifying a word I want in the phrase and a word I want in the anagram of the phrase. For example, I wanted to be able to say that I want an anagram for "America" and "rich" and find phrases containing "America" and their corresponding anagrams containing "rich".

I did find some anagram generators on the web, but all of the ones I found required that you specify the entire phrase up front and then just rearranged the letters, but did not allow specifying just a subset of the phrase. Such a tool does take some of the work out of anagramming (who knew this is actually a word too), but the main challenge still remains to come up with the original phrase to begin with.

But then I thought, hey I have all the tools I need to do something like this myself! So this became my little weekend big data project. It resulted in http://www.anagramfinder.net/. Could still use more work, of course, but after just a few minutes of playing with it I was able to come up with some non-trivial anagrams, like:

rich ornamentation		not in north America
restaurant food		store fat around
vampire hunter		never a triumph

Want to make your own anagram finder? It's real easy! Here's what you will need:

Google Books Ngram Dataset

Amazon Web Services

Hadoop

Lucene

Step 1 - Input Dataset: Google Books Ngrams

First of all, we need to ask the question: what constitutes a valid English phrase? Of course, there are many ways to answer this question. I'm going to go with the opportunistic approach and say that a valid English phrase is any sequence of words up to five words long which appeared in an English book somewhere... and that book got scanned by Google at some point. That's opportunistic, because the dataset of all such phrases - "ngrams" - just happens to be made publicly available by Google.

The only problem is that this dataset is some 800GB in size if you were to try to download it from the Google website. Not a good idea to try this at home. (I was actually careless enough to start attempting it, but luckily I have a husband with a good head on his shoulders... and one who has by now been trained to react to my mumbling things like "cool dataset" or "now if I can only get the shell script to download it in parallel..." - and he stopped me before I got us banned for life from our ISP.)

But that's were Amazon Web Service (AWS) comes to the rescue. Not only does it host the dataset in its cloud, but it provides a Hadoop platform - Elastic MapReduce (EMR) - which you can use to crunch it!

Step 2 - MapReduce

Next we need to process the ngrams, so as to group ngrams which are anagrams of each other together. We also need to collapse multiple occurrences of the same ngram, since the dataset has separate entries for different ngram-year pairs. Finally, we would like to cleanse the dataset a bit, so as to keep only the ngrams which are composed of letters and spaces, since there are some ngrams in there which contain symbols and we don't want to worry about those. We also don't need to retain groups of ngrams which contain only one ngram, since naturally those won't lead to any anagram pairs.

This is a textbook example of a job for MapReduce (implementation left as an exercise to the reader). My implementation which was pretty straightforward and took about 8 hours to run on 20 small instances on Elastic MapReduce, so cost me $16 (I could probably have saved some money by bidding on spot instances, but I didn't bother, since it was a fairly small job.)

The nice thing about using AWS was that I was able to test my job locally before submitting it to the cluster, since EMR runs standard Hadoop 0.20. So I was able to write my map reduce job in Java and test it on my local machine using as input just one of the files downloaded from the Google dataset. Then I just uploaded my Java jar file to Amazon Simple Storage Service (S3) and used the S3 paths of the public Ngrams dataset as input and my private S3 bucket as output. The only thing to note is that the public dataset is available in the form of sequence files, whereas the files you download are text files, so you have to switch your job from TextInputFormat to SequenceFileInputFormat when running it on AWS, but that was the only difference.

The output of the job turned out only 8GB, because the original dataset contains a lot more information than is needed - probably mainly the breakdown of ngram frequencies by year. That's something you can download to your local machine for further processing, to make things simpler.

Step 3 - Lucene Index

The next step is indexing. The MapReduce job produced ngram groups which are all pairwise anagrams of each other. Each such group will be a Lucene document. Each document has two fields: the Words field and the Ngrams field.

The Words field contains all of the words in all of the ngrams in the ngram group. This field is indexed, but not stored, so it allows us to find all of the documents (ngram groups) which contain a given word. I additionally associated a payload with each word proportional to the frequency of the most frequent ngram in the group which the word belonged to and used that to boost the relevance of documents in the search. This caused frequent ngrams to bubble up to the top in the results. I'm not sure if that's the best relevance function to use, but I just wanted to play with payloads. Figuring out a better relevance score is definitely something to think about.

The Ngrams field is stored and not indexed and contains all of the ngrams in the group. This means that we can't search by it, but we can use it to retrieve all of the ngrams in a given document once we have found the document by searching for words.

While playing with the dataset, I found that there are a lot of trivial subgroups of each ngram group which contain the same words, but in different order, for example:
this is it
is this it
is it this
Those lead to a lot of cognitive overhead when browsing through results, so I decided to keep only one (the most frequent) ordering of words from each such group. This also reduces the index size and makes search faster.

The indexing process took 42 minutes on my machine and resulted in an index of 5.5GB (Lucene is super efficient in the way stores things.) The index stores a total of 127,627,663 ngrams in 58,581,588 groups.

Step 4 - Lucene Search

Once the index is built, search is simple. First we identify all the documents which contain all of the words the user requested by performing a Boolean query on the Words field. Then, for each such document, we retrieve its ngrams by retrieving the contents of the Ngrams field. We scan the ngrams to identify those which contain all of the words the user requested to be in the left phrase and all the ngrams which the user requested to be in the right phrase and return the Cartesian product of those two sets.

Now to be completely honest, we should evaluate the relevance (for whatever relevance measure we decide on) of each such pair separately and reorder the results accordingly, instead of just relying on the relevance of each document. But this again, I left to "future work".

Final Result

As mentioned earlier, this is the anagram finder I ended up with: http://www.anagramfinder.net. (No guarantees that it's up, though, since I'm only paying for one AWS micro instance to host it.) Can you make a better one? Give it a try - it's fun :)

Friday, May 20, 2011

Hadoop Dont's: What not to do to harvest Hadoop's full potential

We've all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that's the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.

So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it's not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.

So what went wrong?

The problem with Hadoop is that it is relatively easy to get started using it without an in-depth knowledge of what gives it its powers and without this, you are more likely than not to design your solution in a way which takes all of those powers away. So let's take a look at the few key features of Hadoop and what not to do to keep them.

Feature: Efficiency

There are a couple of things Hadoop does to ensure efficiency:

It brings the computation to the data. So instead of sending large amounts of data over the network to the machines which execute the computations, it tries to run the computations directly on the nodes which contain the required inputs.
It processes files sequentially, reducing the number of costly disk seeks. Unless the Hadoop cluster is running on SSD hard drives, it will take it order of 10ms to seek to a new place on the disk. On the other hand during the same amount of time, it can read an order of 10 megabits of data from the disk sequentially. So you can't process large amounts of data if the process involves frequent seeks.
It uses compression, memory buffers and other optimizations to make the data flow in the system very efficient.

Do not: Physically separate your data cluster from your compute cluster. Whether your data is coming from HDFS, HBase, Cassandra or somewhere else, if it is not stored on the same machines as the MapReduce jobs are running on, it is impossible for Hadoop to bring the computation to the data. Cases exist when such a solution is acceptable, in particular when the jobs are more compute intensive than data-intensive (they do a lot of computing and not a whole lot of reading and writing data) and Hadoop can still be a good option for such jobs, but it's full potential is with jobs which process large datasets with relatively simple computations.

Do not: Create location inaware input formats. Often the input formats which come with the data storage you are using do not exactly fit your system and you have to write your own. Make sure that as you do that, you implement location-awareness by implementing a non-empty InputSplit.getLocation() or by inheriting this functionality from a superclass if you are extending an existing input format.

Do not: Read or write external data in the map or reduce tasks (apart from reading the input and writing output as orchestrated by the framework). For example, you may be tempted to make database read or write calls directly in the map or reduce code and there's nothing to prevent you from doing it. But especially if you are doing a lot of small random lookups, you are losing all of the optimization the framework provides for the efficient streaming of data through the system. Also, if you are accessing a limited availability resource, such as a SQL database, you may be introducing a bottleneck which prevents the solution from scaling (see Scalability).

Do not: Write code to copy files around manually. Once again, the Hadoop framework does a lot to optimize the flow of data during a compute job, but copying files around manually will always incur heavy network traffic (as data is replicated) and cannot be optimized since the system does not know what it is you are trying to do. Apart from the input/output mechanism, Hadoop also offers a distributed cache which can be used to bring data to your tasks. Those provided mechanisms are usually the most efficient way of bringing data and computation together.

Feature: Scalability

Hadoop is highly scalable in the sense that you can grow your cluster to many thousands of nodes and the computing throughput increases linearly with it. That's because Hadoop doesn't have bottlenecks. But that doesn't mean that a system implemented on top of Hadoop automatically doesn't have bottlenecks either.

Do not: Synchronize tasks. Since tasks are executing in parallel, it may be tempting to add synchronization between them and with tools such as ZooKeeper it is fairly easy to add all sorts of distributed synchronization mechanisms such as locks or queues. But every synchronization is a bottleneck, so workflows using those almost never scale.

Do not: Use a constant number of mapper or reducer tasks. The number of map tasks a job is split into is proportional to the size of the input, so the scaling of the number of mappers comes naturally as the size of the input grows (as long as you remember about this if implementing a custom input format). But the number of reducers is defined by the programmer. Since the number of reducers also defines the partitioning of the output, it may be tempting to keep it constant or maybe even always keep just one reducer, so that the output is just a single file. But if you do this, you are preventing the system from scaling out as the amount of data and the cluster grows.

Do not: Talk to the job tracker. It may be tempting to have tasks talk directly to the job tracker which scheduled the job to find out extra information about the job or the other tasks. For example, you may be tempted to have the reducer ask the job tracker for the total number input records, in order to turn the classic word count example into an IDF calculation. But the job tracker is a single resource, so such practices could prevent your job from scaling and could also give misleading results, since the job tracker may not always be serving up the latest statistics as the job is still running.

Feature: Reliability

The Hadoop MapReduce framework handles failures gracefully. If a node in the cluster fails mid-task, it will just re-run the task on a different node. It also hedges against having to wait for slower nodes by starting tasks on multiple machines to begin with (what is referred to as "speculative execution"). But if you are not aware of these mechanisms, your system may end up not functioning as expected.

Do not: Write an output format without considering output committer functionality. If using FileSystemOutputFormat or its derivatives, only the output of the successful tasks makes it to the output directory of the job and the rest are discarded. This logic is handled in by the FileOutputCommitter. Some custom formats may not need committers, for example if they are writing to a key-value store which automatically keeps only one value per key, but in other cases, failed and subsequently restarted jobs, or tasks scheduled speculatively, could result in output data inconsistencies, such as duplicate records.

Do not: Write tasks with side effects. If a task has side effects, for example if it writes to an external data store directly, then retried tasks or tasks scheduled concurrently through speculative execution, may end up not functioning as desired.

Thursday, May 19, 2011

Introduction

We live in a unique time when a number technologies and trends have finally come together. I'm talking about the big data overload, cloud computing, machine learning and corporation-backed open source initiatives. I predict that all this will soon result in a revolution... a shift in the way we live perhaps even as profound as the one brought on by the birth of the Internet. I'm thrilled to watch that unfold. I have a chance of doing so by working at a company which is part of this trend (the second one in a row by now) and will use this blog to share some of the things I have learned so far. I suspect most of it will be technical and having to do with emerging technologies such as Hadoop, Lucene, or NoSQL. Maybe some of it will be more philosophical... I don't know. Anyway, thanks for visiting and hope you enjoy what you find here!