Data Bonanza

Thursday, July 31, 2014

Eight steps to true AI: we are 75% there

Image by amattox mattox, colors altered

Disclaimer: I am currently between jobs, the opinions expressed here are entirely my own and do not necessarily reflect the opinions of any past or future employers.

With the advent of deep learning, AI winter is officially over. Neural networks are back with a vengeance. Once again imaginations are running wild with post-apocalyptic scenarios in which an evil artificially intelligent being takes over the world.

I am fairly confident that true human-like AI, once we do eventually develop it, will be much more like the Bicentennial Man than Skynet. I am eagerly awaiting the moment. This makes me wonder how much time we actually have left before it happens and what pieces are still missing.

The exciting thing is that a lot of the pieces which just a few years ago I thought were still missing, are largely in place now. The way I see it, only two pieces out of eight still remain before a big breakthrough happens, so in a way we are 75% there. Below is a list of these pieces. All of them have been researched to some extent, but I consider the first six largely a done deal. It is the last two which mostly separate us from the end goal.

✔ Scale of data and computation. This is a pretty trivial one: computer systems today rarely match the scale of a single human brain, but the sizes are growing rapidly, so it is only a matter of time until we reach a critical mass of raw computational power. What is perhaps less trivial, is the amount of data which needs to be fed to such large systems in order to make them useful, but thanks to our growing reliance on computers and the steady transition of our lives into the virtual world, the amount of available data is also growing exponentially.
✔ Unsupervised learning. This is related to the above. Supervised learning simply doesn't scale, so the only way to learn from the vast amount of data available today is to let an algorithm sort it out on its own. Starting from Google's seminal work, recent results have shown that unsupervised algorithms can tackle problems which up until recently were only achievable using supervised learning.
✔ Neural networks. I was always fascinated by neural networks and never gave up on them (even though I was sometimes embarrassed to admit this once they fell out of grace). Deep learning algorithms have breathed new life into the approach and neural networks are again in the forefront of machine learning research. More than the algorithms themselves, what I think is important is the amount of research going into computing systems designed to mimic the human brain, since there is still so much we can learn from it.
✔ Hierarchical modularity. All complex systems follow a hierarchically modular design in which simpler components are combined to produce ever more complex ones. Up until recently, most machine learning models did not have this property. This changed with the advent of deep learning networks in which each subsequent layer builds on the previous one to learn ever more complex features. Even though this strictly layered structure is not very natural (more on this in point 8), the general idea of modularity is a crucial one.
✔ Information-theoretic approaches to modeling understanding. This recent article shows that there is still a lot of controversy around what it means to understand something and how to model understanding computationally. But modeling understanding may turn out simpler than it seems. It may boil down to what are essentially compression algorithms.

Two approaches which emerged recently seem to point in this direction. One is sparse coding. Sparse coding is a technique used in deep neural networks. The essence of the idea behind it is to represent the observed data with the smallest generative model possible. As part of a deep neural network, these models learn to represent ("understand"?) complex visual features, such as human (or cat) faces.

A related approach involves building models which can take incomplete or noisy data and reconstruct the missing parts of it. My favorite example is word2vec which trains a model to guess words in a sentence given the surrounding words. The fascinating thing about it is that the resulting model — vector representations of words — reflects many properties of the words' actual meanings. For example, variations of a word and its synonyms end up close in the vector space. Linear arithmetic on the vectors reflects analogies, such as "Rome is to Italy as Paris is to France", or "man is to king as woman is to queen". In a way, the model learns to "understand" the words.

Whether these simple information-theoretic approaches are really the key to what we perceive as understanding may forever remain a philosophical question. But the results speak for themselves and so far the results of applying these approaches to computer vision and natural language processing tasks are enough to convince me that they are the key ingredient and that there is not much more to what we perceive as understanding.
✔ Transfer learning. Transfer learning is applying the knowledge gained from learning one domain to a different domain. It is clear that intelligent life forms do this, but most machine learning algorithms used today are trained and evaluated on one very specific problem in one domain only.

I am going to give this one a check mark anyway, a little bit ahead of time. First, because it is an active area of research and discussion. Second, because the approaches described in the previous point are a precursor to full-blown transfer learning. For example, word representations learned by training a model to guess surrounding words in a sentence can then be used for a wide variety of natural language processing tasks, such as part-of-speech tagging, named entity recognition, sentiment analysis, etc. While still not cross-domain (all these are NLP problems), we are seeing models trained to solve one problem contribute to solving other very different problems within the same domain.
✘ Evolutionary algorithms. The majority of algorithms used to train machine learning models today all follow the same basic principle. This is true for both simple models, such as logistic regression, or backpropagation neural networks, and the newer deep learning approaches. The principle is to use gradient descent or related optimization algorithm to optimize a well-defined function. This function can be the error in the case of supervised learning, or a measure of how well the model represents observed data in the case of the newer unsupervised deep learning approaches.

But nature does not do gradient descent. Even though the theory of evolution often cites a fitness function, no such function can be formally defined. Organisms exist and interact with each other in an ever-changing environment and what it means to be "fit" is different in every instant of time and every point in space. Rather than optimizing a specific function, the optimization algorithm used by nature is the very simple, yet profoundly powerful duet of variation and selection, which does not require the definition, or existence, of an explicit fitness function.

Evolutionary algorithms which do not optimize for a well-defined function, but are based purely on variation and selection, have existed for a long time. They are not, however, widely used in training machine learning models today, because they are much less efficient than the algorithms which optimize a well-defined function.

This is one thing which I believe will change in the next years and will open the path to true AI. The advantage of gradient-descent-like algorithms, the ability to quickly converge in a defined direction, will also be their undoing, since they are not able to optimize for the unknown, or for what may not even exist. Real intelligence is a constantly moving target and cannot be modeled by a simple function.

My prediction is that in the next years evolutionary algorithms will experience a revival similar to the one neural networks experienced recently. We will start using purely evolutionary algorithms for training models. This will require more computational power, but will ultimately produce much more complex and powerful systems.
✘ Learning network topologies. The neural network models being trained today have a fixed topology. Training consists of updating the connection weights. Whereas it has been postulated that training the structure itself could be beneficial, there is no good way of doing it using the current learning algorithms. This will change after the previous piece falls into place and we move to evolutionary training algorithms. When that happens, the network topologies will evolve as well. Perhaps at this point the topology will even fully define the model and there will be no need for additional weights.

I am pretty sure that those evolved topologies will not have that neat man-made layered architecture modern-day neural network models have today. Rather, my bet is that they will resemble other graphs which occur naturally in nature, such as the actual connections in the brain, social networks and airline routes: small-world networks. These topologies evolved naturally and are an effective way of creating highly connected scalable systems in which information flows efficiently.
(Bonus) Redundancy and robustness. Human beings can lose large chunks of their brain sometimes with surprisingly little impairment. Conversely, deep learning networks get thrown off by a couple of distorted pixels. Clearly, real life intelligent systems have a lot of redundancy and robustness built in and true AI will necessarily have it as well. I am not including it in the count, though, because I think that it will be an emergent property of the system, which will just arise naturally once all of the other pieces fall into place.

The past few years have witnessed a continued explosive growth in data sizes and the availability of computational resources. This was coupled with advances in algorithms, such as deep learning, which make use of it and exhibit increasingly intelligent behavior. These behaviors, such as the ability to learn meaningful visual features, or word representations, are increasingly resembling elements of human intelligence.

Yet, we have not been able to create anything which fully resembles true AI, so clearly there is still something missing. I postulate that this missing piece is moving away from optimizing well-defined functions and towards purely evolutionary trial-and-error algorithms. Furthermore, these algorithms should be applied to train model topologies and that these topologies will end up resembling small-world networks. Such approaches will need a lot of computational power and data to train, but those will be available in the near future.

Of course, there is no way of really predicting when and how the singularity will happen... but I do have the feeling that we are getting awfully close. What do you think?

Monday, June 23, 2014

Data dredging and statistical significance explained, or how to win a meritless discrimination lawsuit

I just came up with the following interview question for data scientists:

You apply for a position at a company and interview with five individuals, each of which then casts a hire / no-hire vote. You don't get the job and decide to file a discrimination lawsuit against the company. You don't specify which of the five interviewers was being discriminatory or if you were being discriminated against based on gender, race, religion, age, or sexual orientation. The judge forces the company to disclose historical hire/no-hire decisions for each of the five interviewers and declares that if you can show that one of the interviewers is significantly prejudiced (at p = 0.05) with respect to one of the five characteristics, then they will rule in your favor. Assuming that none of the interviewers are actually biased, what is your chance of winning the case?

Can you guess the answer?

The answer is: approximately 72%. Yep, you are almost three times as likely to win the case than to lose it, even when no actual discrimination is taking place.

This apparent paradox is a typical example of data dredging and it is what people mean when they warn against the perils of big data, or even downfall of science. The problem is not intrinsic to big data or science, but stems from what most problems tend to stem from: people doing things without fully understanding what they are doing.

So I thought I would take this chance to explain some of the key concepts behind statistical significance and data dredging using the discrimination lawsuit paradox as an illustration.

Statistical significance

Let's say you have historical interview decision data for interviewer X and it looks like this:

Candidate	Gender	X's decision
1	female	hire
2	female	no hire
3	male	hire
4	female	hire
5	male	no hire
6	male	no hire
7	female	hire
8	female	no hire
9	female	no hire
10	female	hire
11	male	no hire
12	male	no hire
13	female	no hire
14	male	no hire
15	female	hire
16	male	no hire
17	male	hire
18	male	hire
19	male	no hire
20	male	no hire

The question is: is X discriminating based on gender, or in other words, does X have a bias towards men or women?

We can see that he (or she?) voted hire for 3 out of 11 men (27%) and 5 out of 9 women (56%) — more than double the rate. So it looks like he may have some bias against men. But does this really mean that he is discriminating, or did he just happen to come across some really qualified female candidates and some really unqualified male ones?

This is the question which statistical significance attempts to answer, but it is important to realize that no mathematical system can actually answer this question conclusively, so we go after "the next best thing" and instead try to answer a related question:

"Let's give X the benefit of the doubt and assume that he is not discriminating (the 'null hypothesis'). What are the odds that he would be so unlucky with his male candidates to end up with this amount (or more) of skew towards women?"

To answer this question, let's assume that X was completely blind to the gender when he gave away his eight hire votes and that gender was randomly assigned to the candidates after this fact. If that were the case, the chance of men getting x of the hire votes and women getting 8-x would be:

(	11	)	*	(	9	)	/	(	20	), so:
	x				8-x				8

Hire's for men	Hire's for women	Probability
0	8	0.007%
1	7	0.314%
2	6	3.668%
3	5	16.504%
4	4	33.008%
5	3	30.807%
6	2	13.203%
7	1	2.358%
8	0	0.131%

The p-value is simply the sum of the probabilities in the row in question (in this case 16.504%) and all the rows which are even more "extreme" (in this case, even more biased against men). So in this case the p-value is around 20%, or 0.2, which is usually not considered "statistically significant". The lower the p-value, the more statistically significant the result. Typical p-value thresholds for achieving statistical significance include 0.1, 0.05, or 0.01. So if there had been only two men among the eight that X voted hire for, the bias would have been considered significant having a p-value of around 0.04. But since he (or she) voted hire for three men, the bias should not be considered statically significant.

A common misconception at this point is to assume that this means that the probability that X is biased is 1-p = 80%. But that's simply not how probability works, which we can easily see by applying Bayes rule:

P(X is biased | this or more extreme result)
= P(NOT null hypothesis | this or more extreme result)
= 1 - P(null hypothesis | this or more extreme result)
= 1 - P(this or more extreme result | null hypothesis) * P(null hypothesis) / P(this or more extreme result)
= 1 - p * P(null hypothesis) / P(this or more extreme result)
≠ 1 - p

This misconception keeps getting repeated leading to a lot of confusion, but the truth is that you just can't compute the actual probability that X is biased, unless you have access to all the different parallel universes in some of which X is biased and others isn't.

Note: In real life we would also have to account for the overall rate of hire vs. no-hire decisions for the two genders, which may not actually be equal for reasons unrelated to discrimination. For example, girls may tend to chose different colleges than boys and some colleges may be better than others at preparing candidates for whatever is the job in question. But this would make the math more complicated, so for the purpose of this example, let's assume that the overall rates are the same for men and women in the general population.

Data dredging

Coming back to the original interview question: we have five different interviewers and each of the five interviewers can be prejudiced with respect to five different possible factors (gender, race, religion, age and sexual orientation). This makes 25 possible combinations of interviewer and prejudice. For each of these 25 combinations we can compute the p-value just like we did in the previous section. Assuming that none of the interviewers is actually biased, the probability of any one of these results showing statistically significant bias with a p-value threshold of 0.05 is 0.05 — that's just the sheer definition of p-value. But this means that the probability of at least one of the 25 showing statistically significant bias is 1 - (1-0.05)²⁵ ≈ 0.72.

In general, the more hypotheses you are testing in hope of uncovering a "statistically significant" one, the more likely that at least one of them will show significance due to pure chance even if no underlying pattern exists (a false positive). This chance gets lower if you lower the p-value threshold, but no matter how low the p-value threshold is, given enough possible hypotheses, the probability of at least one false positive quickly approaches 1 (see graph below).

Conclusion

This shows that you should always take every statistically significant result with a grain of salt and ask yourself: how many hypotheses were tested before this seemingly significant result was found? You can see many examples of this in medicine (testing numerous drugs to see what works, or testing multiple combinations of genes and conditions to find a correlation), as well as software development (A/B testing multiple features and adopting only the ones which show "significant" improvement). There is nothing wrong with computing p-values, but taking action without knowing what they mean exactly can quite literally have perilous results.

Having gone through this whole chain of thought, I decided that it probably wasn't a good idea to give this interview question to actual candidates, just in case it gives them any ideas... ;)

Friday, January 17, 2014

Why the sum of all natural numbers is not negative one twelfth

This video posted by Numberphile has been making the rounds lately, supposedly proving that the sum of all natural numbers all the way to infinity is equal to negative one twelfth. The guy in the video, Tony, uses pretty basic math to derive this formula and at first glance it may not be immediately clear what is wrong with his derivation. To add to the mystery, he flashes a page from a String Theory book which quotes the formula, albeit in a very different context.

Despite a lot of controversy this video seems to be generating lately, I don't see a good explanation of what exactly is wrong with this proof among the top Google results, so I'm going to try to make it clear here.

Main points

The derivation uses "..." to denote something different in the case of computing S₁ and something else in the case of S₂, hence the entire derivation is invalid.

There is no interpretation of "..." which, if used consistently alongside sums of natural numbers, would lead to S = 1 + 2 + 3 + ... = -1/12.

Now, mathematical models may exist which use similar notation to the one used for integers and sums, but model phenomena other than natural numbers. Such models may abide by different rules and you may get results which don't carry over to the natural number world. This is probably what the String Theory book is doing (you can check it out at Google Books here). But these mathematical models would still have to be internally consistent in order to be useful, unlike the Numberphile framework which could be used to derive contradictory results, hence has no application other than entertainment.

Details

The first part or the proof concludes that:

S₁ = 1 - 1 + 1 - 1 + 1 - 1 ... = 1/2

The argument behind this is that if the sum were cut short at any given point: if this point were odd, the sum would be 1 and if it were even, the sum would be 0. Tony goes on to say: "Do we stop at an odd or even point? We don't know, so we take the average of the two."

Note that this whole argument relies on the assumption that we do stop the series at some point. We just don't know if this point is odd or even, but not stopping at all is not an option. This is an essential distinction.

The traditional mathematical definition of infinite sums, assumes that you never stop. Since you never stop, you never stop at a point which is odd or even, hence the sum is undefined (we'd call this a divergent series).

So what Tony is really computing here, is not the infinite sum, but the expected value of a finite sum stopped at a random place. In more verbose notation we would write what Tony said as:

S₁ = 1 - 1 + 1 - 1 + 1 - 1 ... (n times)

S₁ = {0 when n is even; 1 when n is odd}

P(n is even) = 1/2

P(n is odd) = 1/2

E(S₁) = 1/2 * 0 + 1/2 * 1 = 1/2

There would be nothing wrong with using the infinite sum notation to denote the expected value of a finite sum. In fact, the above definition is called the Cesàro sum and is a well-known mathematical concept. The problem is that having chosen the Cesàro sum as the interpretation of "...", Tony would have to stick to this definition for the remainder of the proof. But he does not do that. In fact, for the rest of the proof, he uses the same notation to mean the actual infinite sum, which never stops.

This is clear, for example, in his derivation of S₂. The derivation of S₂ is based on adding the series to a version of itself shifted by one, like so:

S₂	=	1 - 2 + 3 - 4 + 5 - 6 ...
2 * S₂	=	1	- 2 + 3 - 4 + 5 - 6 ...
			+ 1 - 2 + 3 - 4 + 5 ...

This is a perfectly valid operation when you are dealing with actual infinite series, because those go on forever. So even if you shift them, they can still align perfectly all the way to infinity. But it does not work under the expected value definition, since the expected value definition assumes that the series stops at some point, in which case the shifted version of S₂ would have an extra term at the end which would not align with the un-shifted version.

With all this in mind, here is the video again:

Sunday, August 5, 2012

The evils of code coverage and other bad metrics

Today is all about being data-driven. And the first thing you need for being data-driven is metrics (or so called Key Performance Indicators – KPIs, if you like acronyms). So we all scramble to establish our metrics quickly, so that we can start being data-driven as soon as possible. And thus we often fall into a trap, because bad metrics can cause much more harm than being data-driven can cause good in the first place. There are plenty of excellent examples of bad metrics and their destructive power in the literature, but I would like to call out some examples from the area which is closest to my heart: software engineering.

Example 1: Number of bugs fixed

The first time I witnessed the effect of bad metrics, was at my first full-time software engineering job back in 2007. Somebody had the idea to send out daily e-mails with the number of bugs fixed per each developer, something like this:

Today’s bug fix results:

Bob: 10
Iwona: 5
Steve: 3
Anna: 2

The e-mails had an immediate effect: everything just stopped working. Why? Well, we all felt pressured – probably more in front of each other than in front of anybody else. Everybody wanted to be high on that list for the sheer pleasure of being on top of a list. So we all automatically switched our behavior to optimize for the metrics we were presented with.
But everybody has limited resources and everything is a tradeoff, so optimizing one thing means sacrificing another. So we focused on fixing bugs as fast as we could. And since there were many bugs to choose from, we fixed the ones which allowed us to fix more in the finite time we had, such as “First letter in address label should be capitalized.” We skipped over the bugs which were really hard to fix, but of course those were the ones which were the most important.

Example 2: Code coverage

At another company we measured code coverage to get developers to write unit tests. Perhaps this did motivate some developers who weren't writing unit tests at all before to start writing some unit tests. But what was much worse was that developers who were writing useful unit tests before, switched to writing completely useless ones. They were useless, because they focused on code coverage and code coverage is not an indication of unit test effectiveness.
Here is a made up example, but in the spirit of what I observed. Let’s say we are testing the following function:

public static boolean extractFlag(Integer flags, Integer bit) {

    if (flags == null) {
        throw new IllegalArgumentException();
    }
    if (bit == null) {
        throw new IllegalArgumentException();
    }
    return (flags.intValue() >>> bit.intValue()) % 2 > 0; 
}

Here is what a reasonable set of unit tests may have looked like before code coverage:

@Test

public void testExtractFlag1() {
    assertFalse(FlagUtils.extractFlag(2, 0));
}

@Test
public void testExtractFlag2() {
    assertTrue(FlagUtils.extractFlag(2, 1));
}

@Test
public void testExtractFlag3() {
    assertFalse(FlagUtils.extractFlag(Integer.MAX_VALUE, 31));
}

@Test
public void testExtractFlag4() {
    assertTrue(FlagUtils.extractFlag(-1, 31));
}

The above would have been pretty useful tests. They would have uncovered all of the following common mistakes on the last line of the function:

return (flags.intValue() / (1 << bit.intValue())) % 2 > 0;

return (flags.intValue() / bit.intValue()) % 2 > 0; 
return (flags.intValue() >> bit.intValue()) % 2 > 0;

But the coverage of these tests is pretty bad. They don’t cover all of the exception cases. So once the code coverage mandate was put into place, tests started morphing into something like this:

@Test(expected=IllegalArgumentException.class)

public void testExtractFlag1() {
    FlagUtils.extractFlag(null, 0);
}

@Test(expected=IllegalArgumentException.class)
public void testExtractFlag2() {
    FlagUtils.extractFlag(0, null);
}

@Test
public void testExtractFlag3() {
    FlagUtils.extractFlag(2, 1);
}

@Test
public void testExtractFlag4() {
    FlagUtils.extractFlag(3, 1);
}

Even though this particular example is made up, the general idea is exactly what happened: I literally saw hundreds of unit tests without a single assertion statement in them. But guess what, 100% coverage. Needless to say, these unit tests would not catch any of the common mistakes listed above.

Example 3: Lines of code

I was discussing this with @turboCodr one day when he mentioned an even more outrageous example: apparently one company rewarded their employees based on the number lines of code they produce. I won’t even dignify this idea by elaborating on how wrong it is and on how many levels. Suffices to say that if lines of code are any measure of software development progress, it should be how many lines are deleted, not how many are added…

Examples of good metrics

So what are some good metrics that you can use to build a successful data-driven software development shop? I would go with anything that actually reflects a meaningful goal: something which makes sense to the business itself. For example:

Number of customer complaints, bugs found by users, or bugs found in production
Application efficiency (speed, resource consumption, etc.)
System downtime

Conclusions

Metrics are powerful.

So don’t mess with them unless you know what you are doing. It can cause more harm than good.

Correlation is not causation.

Trying to influence something which seems correlated to the goal, but is not the goal itself, rarely works. It is true that code coverage is correlated with having good unit tests, but that’s because good unit tests create code coverage and not the other way around. Thinking that you can improve your unit tests by increasing code coverage, is like thinking that you can reduce crime by reducing the number policemen, since cities with few policemen have fewer crimes.

Good developers, when faced with a lack of clear guidelines, tend to gravitate towards doing what makes sense.

So leaving them be may be better than imposing bad metrics or incentives. Also, gathering their feedback and accounting for extra time for quality improvements can be a good idea.

The only good metrics are those that actually matter to the business.

So if your customers don’t pay you for the number of bugs you fix in your code, the lines of code that are covered by unit tests, or the total lines of code your team produces, then that’s not what you want to measure.

Thursday, December 8, 2011

How to lose the tech talent war

The great article The Rise of Developeronomics argues that the biggest success factor for any company nowadays, software or other, is its ability to attract and retain high-end software development talent, and that it's only going to get more so from here on out. Companies seem to get this - at least judging from the number of unsolicited recruiter mail I keep getting, but often fail miserably in retaining their engineering talent. Why is that?

The reason, it would appear, is that companies fail to see what engineers really want. They assume that their only bargaining power is cash (or other financial instruments, such as stocks or options) and perhaps benefits (but those can be translated into cash in one way or another as well). So they compete with each other on price, which has the only effect of engineering salaries going up industry-wide (not that I complain :)). But as a chunk of the commodity you are so fiercely competing for, let me share a piece of insight: salary is not what makes the big difference. Sure, money is nice and if you don't pay developers a decent amount of money, they will find somebody who does.

But salary concerns are not what usually makes developers leave their job.

And yet developers leave all the time. The turnover rate in the profession is through the roof to the point that even my dentist commented on it the other day! That's costing the industry who knows how many billions in turnover costs and lost productivity. And since that's deadweight loss (yes, I just learned a new term from my economics book), it is making all of us so much poorer!

So what does drive a good developer away? Here are a couple of things:

Closing projects and canceling products before they see the light of day

The thing about software developers, at least all the good ones, is that we work because we like to. Most of us would do it even if we didn't get paid for it. The reason we like it so much is the thrill of creating something. It's the empowering feeling of making something of value, practically out of nothing. It's really exhilarating. So imagine how we feel if we spend all our effort creating something in expectation of this thrill, only to find our work thrown out the window. It's as if a sculptor worked for years on a statue only to have the sponsor of his work demolish it before it was finished. Strange as it may sound, this happens all the time in software companies, big and small. Projects get thrown out, priorities change. I even heard of larger companies deliberately assigning the same project to multiple teams with the intent of keeping one and closing others down. These decisions are made by managers which push around project risk and budget estimates, but are those managers factoring in the cost to their human capital every closed down project is incurring?

Lack of established goals or acknowledging success

Companies often see acknowledging success as a way of directing employees to do certain things and not others. So when they see an engineering team produce good results, they don't fell like they need to acknowledge any individual team member's good work. But acknowledging success is important not just as a directing tool, but as confirmation that what we are doing is being appreciated. Yes, engineers are often needy. And vain. And insecure. We really like to be praised and told that we did well, regardless of whether the praise comes in the form of a cash bonus or just a plain old pat on the back.

But beware not to hand out praise arbitrarily, without any connection to merit, or else the developers will quickly catch on and it will lose its value. We all know who on the team did well and when. We just want to know that you know it too.

Not giving developers a chance to choose what to work on

It's a no-brainer that any employee is happier when working on what they like and having the chance on working on what you like makes you want to stay at the job and not leave. But some managers seem to think that if they don't make decisions on how to hand tasks out to people, then they are not justifying their salaries. Of course, the job of a manager or a lead is to ensure that what needs to get done, gets done. But this still leaves a whole lot of room for optimizing task assignments according to individual preferences. Unionized airlines have built very complex systems around ensuring that crew members get to fly to their individually preferable destinations, so that they can please the union which holds all the cards. If The Rise of Developeronomics is any indication, developers are the ones with the cards in a modern company, so optimizing for their work preferences is a powerful tool for winning them over. And it can even save on some managerial overhead.

No chance of advancing skills or gaining recognition in the community

Like every motivated individual, developers strive to get better at what they do. Part of this is learning new skills and improving existing ones and part of it is establishing oneself in the broader community, since in many ways the community as a whole can achieve much more than the sum of its parts. So allowing and even encouraging such activities can help keep employees happy in addition to growing their value. Companies should take this into account when making a decision not to send employees to a conference, because it would take time away from project work, or preventing an employee from blogging, in fear that it will expose sensitive information. Not to say that decisions like these are never justified, but too much repression can lead to employees feeling like they can't spread their wings and they will leave (especially the ones with wings, which are the ones you want to hold on to the most).

Lack of ownership and accountability

Very good developers are unable to write bad code, but even the best of us try harder when we know that our name is going to be on what we are working on... even more so if that means that we will have to maintain it later :) At the same time, we hate having to pay for other people's laziness, by having to maintain thier bad code. That's not to say that only one person should be looking at any given area of a project. On the contrary, code and design needs to be constantly reviewed by multiple people to ensure quality as well as skills and knowledge transfer. But maintaining clear and constant areas of ownership throughout the process leads to better work quality as well as much less frustration and happier engineers.

Exposure to bad code and bad developers

The Rise of Developeronomics talks about the cost of bad developers incurred when their bad code blows up. But there is another cost of bad developers and that is their bad effect on the good developers. I, for one, suffer when I see bad code. When I see some really bad code, I feel nauseous and need five minutes of browsing Facebook before I can bring myself to look at the IDE again. That's lost productivity right there. And then there's the fact that knowing that other pieces of the product are going to be written badly anyway, doesn't make me want to try as hard to make my pieces perfect. Finally, all the stress caused by having to debug and integrate with badly written components is a motivator to look for a different company with a better team.

So it is important for companies to acknowledge code quality in their process, as well as the individual developers' commitment to it. Note that this is not about junior employees and senior employees, because I don't see all that much correlation between code quality and seniority. It's the right mindset that matters and junior developers with the right mindset will intuitively write high quality code from the start, while sloppy developers will not only poison your code base, but drive all of the good developers away.

Sunday, September 18, 2011

Anagram Finder: A Do-It-Yourself Little Big Data Project

It's amazing what you can do with the available tools these days! A couple of weeks ago I stumbled upon a list of funny anagrams and I thought that it would be fun to generate my own. (Did you know that there is actually a name for "someone who creates anagrams" - who knew?). What I needed was a tool which would allow me to explore possible anagrams by specifying a word I want in the phrase and a word I want in the anagram of the phrase. For example, I wanted to be able to say that I want an anagram for "America" and "rich" and find phrases containing "America" and their corresponding anagrams containing "rich".

I did find some anagram generators on the web, but all of the ones I found required that you specify the entire phrase up front and then just rearranged the letters, but did not allow specifying just a subset of the phrase. Such a tool does take some of the work out of anagramming (who knew this is actually a word too), but the main challenge still remains to come up with the original phrase to begin with.

But then I thought, hey I have all the tools I need to do something like this myself! So this became my little weekend big data project. It resulted in http://www.anagramfinder.net/. Could still use more work, of course, but after just a few minutes of playing with it I was able to come up with some non-trivial anagrams, like:

rich ornamentation		not in north America
restaurant food		store fat around
vampire hunter		never a triumph

Want to make your own anagram finder? It's real easy! Here's what you will need:

Google Books Ngram Dataset

Amazon Web Services

Hadoop

Lucene

Step 1 - Input Dataset: Google Books Ngrams

First of all, we need to ask the question: what constitutes a valid English phrase? Of course, there are many ways to answer this question. I'm going to go with the opportunistic approach and say that a valid English phrase is any sequence of words up to five words long which appeared in an English book somewhere... and that book got scanned by Google at some point. That's opportunistic, because the dataset of all such phrases - "ngrams" - just happens to be made publicly available by Google.

The only problem is that this dataset is some 800GB in size if you were to try to download it from the Google website. Not a good idea to try this at home. (I was actually careless enough to start attempting it, but luckily I have a husband with a good head on his shoulders... and one who has by now been trained to react to my mumbling things like "cool dataset" or "now if I can only get the shell script to download it in parallel..." - and he stopped me before I got us banned for life from our ISP.)

But that's were Amazon Web Service (AWS) comes to the rescue. Not only does it host the dataset in its cloud, but it provides a Hadoop platform - Elastic MapReduce (EMR) - which you can use to crunch it!

Step 2 - MapReduce

Next we need to process the ngrams, so as to group ngrams which are anagrams of each other together. We also need to collapse multiple occurrences of the same ngram, since the dataset has separate entries for different ngram-year pairs. Finally, we would like to cleanse the dataset a bit, so as to keep only the ngrams which are composed of letters and spaces, since there are some ngrams in there which contain symbols and we don't want to worry about those. We also don't need to retain groups of ngrams which contain only one ngram, since naturally those won't lead to any anagram pairs.

This is a textbook example of a job for MapReduce (implementation left as an exercise to the reader). My implementation which was pretty straightforward and took about 8 hours to run on 20 small instances on Elastic MapReduce, so cost me $16 (I could probably have saved some money by bidding on spot instances, but I didn't bother, since it was a fairly small job.)

The nice thing about using AWS was that I was able to test my job locally before submitting it to the cluster, since EMR runs standard Hadoop 0.20. So I was able to write my map reduce job in Java and test it on my local machine using as input just one of the files downloaded from the Google dataset. Then I just uploaded my Java jar file to Amazon Simple Storage Service (S3) and used the S3 paths of the public Ngrams dataset as input and my private S3 bucket as output. The only thing to note is that the public dataset is available in the form of sequence files, whereas the files you download are text files, so you have to switch your job from TextInputFormat to SequenceFileInputFormat when running it on AWS, but that was the only difference.

The output of the job turned out only 8GB, because the original dataset contains a lot more information than is needed - probably mainly the breakdown of ngram frequencies by year. That's something you can download to your local machine for further processing, to make things simpler.

Step 3 - Lucene Index

The next step is indexing. The MapReduce job produced ngram groups which are all pairwise anagrams of each other. Each such group will be a Lucene document. Each document has two fields: the Words field and the Ngrams field.

The Words field contains all of the words in all of the ngrams in the ngram group. This field is indexed, but not stored, so it allows us to find all of the documents (ngram groups) which contain a given word. I additionally associated a payload with each word proportional to the frequency of the most frequent ngram in the group which the word belonged to and used that to boost the relevance of documents in the search. This caused frequent ngrams to bubble up to the top in the results. I'm not sure if that's the best relevance function to use, but I just wanted to play with payloads. Figuring out a better relevance score is definitely something to think about.

The Ngrams field is stored and not indexed and contains all of the ngrams in the group. This means that we can't search by it, but we can use it to retrieve all of the ngrams in a given document once we have found the document by searching for words.

While playing with the dataset, I found that there are a lot of trivial subgroups of each ngram group which contain the same words, but in different order, for example:
this is it
is this it
is it this
Those lead to a lot of cognitive overhead when browsing through results, so I decided to keep only one (the most frequent) ordering of words from each such group. This also reduces the index size and makes search faster.

The indexing process took 42 minutes on my machine and resulted in an index of 5.5GB (Lucene is super efficient in the way stores things.) The index stores a total of 127,627,663 ngrams in 58,581,588 groups.

Step 4 - Lucene Search

Once the index is built, search is simple. First we identify all the documents which contain all of the words the user requested by performing a Boolean query on the Words field. Then, for each such document, we retrieve its ngrams by retrieving the contents of the Ngrams field. We scan the ngrams to identify those which contain all of the words the user requested to be in the left phrase and all the ngrams which the user requested to be in the right phrase and return the Cartesian product of those two sets.

Now to be completely honest, we should evaluate the relevance (for whatever relevance measure we decide on) of each such pair separately and reorder the results accordingly, instead of just relying on the relevance of each document. But this again, I left to "future work".

Final Result

As mentioned earlier, this is the anagram finder I ended up with: http://www.anagramfinder.net. (No guarantees that it's up, though, since I'm only paying for one AWS micro instance to host it.) Can you make a better one? Give it a try - it's fun :)

Friday, May 20, 2011

Hadoop Dont's: What not to do to harvest Hadoop's full potential

We've all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that's the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.

So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it's not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.

So what went wrong?

The problem with Hadoop is that it is relatively easy to get started using it without an in-depth knowledge of what gives it its powers and without this, you are more likely than not to design your solution in a way which takes all of those powers away. So let's take a look at the few key features of Hadoop and what not to do to keep them.

Feature: Efficiency

There are a couple of things Hadoop does to ensure efficiency:

It brings the computation to the data. So instead of sending large amounts of data over the network to the machines which execute the computations, it tries to run the computations directly on the nodes which contain the required inputs.
It processes files sequentially, reducing the number of costly disk seeks. Unless the Hadoop cluster is running on SSD hard drives, it will take it order of 10ms to seek to a new place on the disk. On the other hand during the same amount of time, it can read an order of 10 megabits of data from the disk sequentially. So you can't process large amounts of data if the process involves frequent seeks.
It uses compression, memory buffers and other optimizations to make the data flow in the system very efficient.

Do not: Physically separate your data cluster from your compute cluster. Whether your data is coming from HDFS, HBase, Cassandra or somewhere else, if it is not stored on the same machines as the MapReduce jobs are running on, it is impossible for Hadoop to bring the computation to the data. Cases exist when such a solution is acceptable, in particular when the jobs are more compute intensive than data-intensive (they do a lot of computing and not a whole lot of reading and writing data) and Hadoop can still be a good option for such jobs, but it's full potential is with jobs which process large datasets with relatively simple computations.

Do not: Create location inaware input formats. Often the input formats which come with the data storage you are using do not exactly fit your system and you have to write your own. Make sure that as you do that, you implement location-awareness by implementing a non-empty InputSplit.getLocation() or by inheriting this functionality from a superclass if you are extending an existing input format.

Do not: Read or write external data in the map or reduce tasks (apart from reading the input and writing output as orchestrated by the framework). For example, you may be tempted to make database read or write calls directly in the map or reduce code and there's nothing to prevent you from doing it. But especially if you are doing a lot of small random lookups, you are losing all of the optimization the framework provides for the efficient streaming of data through the system. Also, if you are accessing a limited availability resource, such as a SQL database, you may be introducing a bottleneck which prevents the solution from scaling (see Scalability).

Do not: Write code to copy files around manually. Once again, the Hadoop framework does a lot to optimize the flow of data during a compute job, but copying files around manually will always incur heavy network traffic (as data is replicated) and cannot be optimized since the system does not know what it is you are trying to do. Apart from the input/output mechanism, Hadoop also offers a distributed cache which can be used to bring data to your tasks. Those provided mechanisms are usually the most efficient way of bringing data and computation together.

Feature: Scalability

Hadoop is highly scalable in the sense that you can grow your cluster to many thousands of nodes and the computing throughput increases linearly with it. That's because Hadoop doesn't have bottlenecks. But that doesn't mean that a system implemented on top of Hadoop automatically doesn't have bottlenecks either.

Do not: Synchronize tasks. Since tasks are executing in parallel, it may be tempting to add synchronization between them and with tools such as ZooKeeper it is fairly easy to add all sorts of distributed synchronization mechanisms such as locks or queues. But every synchronization is a bottleneck, so workflows using those almost never scale.

Do not: Use a constant number of mapper or reducer tasks. The number of map tasks a job is split into is proportional to the size of the input, so the scaling of the number of mappers comes naturally as the size of the input grows (as long as you remember about this if implementing a custom input format). But the number of reducers is defined by the programmer. Since the number of reducers also defines the partitioning of the output, it may be tempting to keep it constant or maybe even always keep just one reducer, so that the output is just a single file. But if you do this, you are preventing the system from scaling out as the amount of data and the cluster grows.

Do not: Talk to the job tracker. It may be tempting to have tasks talk directly to the job tracker which scheduled the job to find out extra information about the job or the other tasks. For example, you may be tempted to have the reducer ask the job tracker for the total number input records, in order to turn the classic word count example into an IDF calculation. But the job tracker is a single resource, so such practices could prevent your job from scaling and could also give misleading results, since the job tracker may not always be serving up the latest statistics as the job is still running.

Feature: Reliability

The Hadoop MapReduce framework handles failures gracefully. If a node in the cluster fails mid-task, it will just re-run the task on a different node. It also hedges against having to wait for slower nodes by starting tasks on multiple machines to begin with (what is referred to as "speculative execution"). But if you are not aware of these mechanisms, your system may end up not functioning as expected.

Do not: Write an output format without considering output committer functionality. If using FileSystemOutputFormat or its derivatives, only the output of the successful tasks makes it to the output directory of the job and the rest are discarded. This logic is handled in by the FileOutputCommitter. Some custom formats may not need committers, for example if they are writing to a key-value store which automatically keeps only one value per key, but in other cases, failed and subsequently restarted jobs, or tasks scheduled speculatively, could result in output data inconsistencies, such as duplicate records.

Do not: Write tasks with side effects. If a task has side effects, for example if it writes to an external data store directly, then retried tasks or tasks scheduled concurrently through speculative execution, may end up not functioning as desired.