Thursday, July 31, 2014

Eight steps to true AI: we are 75% there

Disclaimer: I am currently between jobs, the opinions expressed here are entirely my own and do not necessarily reflect the opinions of any past or future employers.

With the advent of deep learning, AI winter is officially over. Neural networks are back with a vengeance. Once again imaginations are running wild with post-apocalyptic scenarios in which an evil artificially intelligent being takes over the world.

I am fairly confident that true human-like AI, once we do eventually develop it, will be much more like the Bicentennial Man than Skynet. I am eagerly awaiting the moment. This makes me wonder how much time we actually have left before it happens and what pieces are still missing.

The exciting thing is that a lot of the pieces which just a few years ago I thought were still missing, are largely in place now. The way I see it, only two pieces out of eight still remain before a big breakthrough happens, so in a way we are 75% there. Below is a list of these pieces. All of them have been researched to some extent, but I consider the first six largely a done deal. It is the last two which mostly separate us from the end goal.
  1.  Scale of data and computation. This is a pretty trivial one: computer systems today rarely match the scale of a single human brain, but the sizes are growing rapidly, so it is only a matter of time until we reach a critical mass of raw computational power. What is perhaps less trivial, is the amount of data which needs to be fed to such large systems in order to make them useful, but thanks to our growing reliance on computers and the steady transition of our lives into the virtual world, the amount of available data is also growing exponentially.
  2.  Unsupervised learning. This is related to the above. Supervised learning simply doesn't scale, so the only way to learn from the vast amount of data available today is to let an algorithm sort it out on its own. Starting from Google's seminal work, recent results have shown that unsupervised algorithms can tackle problems which up until recently were only achievable using supervised learning.
  3.  Neural networks. I was always fascinated by neural networks and never gave up on them (even though I was sometimes embarrassed to admit this once they fell out of grace). Deep learning algorithms have breathed new life into the approach and neural networks are again in the forefront of machine learning research. More than the algorithms themselves, what I think is important is the amount of research going into computing systems designed to mimic the human brain, since there is still so much we can learn from it.
  4.  Hierarchical modularity. All complex systems follow a hierarchically modular design in which simpler components are combined to produce ever more complex ones. Up until recently, most machine learning models did not have this property. This changed with the advent of deep learning networks in which each subsequent layer builds on the previous one to learn ever more complex features. Even though this strictly layered structure is not very natural (more on this in point 8), the general idea of modularity is a crucial one.
  5.  Information-theoretic approaches to modeling understanding. This recent article shows that there is still a lot of controversy around what it means to understand something and how to model understanding computationally. But modeling understanding may turn out simpler than it seems. It may boil down to what are essentially compression algorithms.

    Two approaches which emerged recently seem to point in this direction. One is sparse coding. Sparse coding is a technique used in deep neural networks. The essence of the idea behind it is to represent the observed data with the smallest generative model possible. As part of a deep neural network, these models learn to represent ("understand"?) complex visual features, such as human (or cat) faces.

    A related approach involves building models which can take incomplete or noisy data and reconstruct the missing parts of it. My favorite example is word2vec which trains a model to guess words in a sentence given the surrounding words. The fascinating thing about it is that the resulting model  vector representations of words  reflects many properties of the words' actual meanings. For example, variations of a word and its synonyms end up close in the vector space. Linear arithmetic on the vectors reflects analogies, such as "Rome is to Italy as Paris is to France", or "man is to king as woman is to queen". In a way, the model learns to "understand" the words.

    Whether these simple information-theoretic approaches are really the key to what we perceive as understanding may forever remain a philosophical question. But the results speak for themselves and so far the results of applying these approaches to computer vision and natural language processing tasks are enough to convince me that they are the key ingredient and that there is not much more to what we perceive as understanding.
  6.  Transfer learning. Transfer learning is applying the knowledge gained from learning one domain to a different domain. It is clear that intelligent life forms do this, but most machine learning algorithms used today are trained and evaluated on one very specific problem in one domain only.

    I am going to give this one a check mark anyway, a little bit ahead of time. First, because it is an active area of research and discussion. Second, because the approaches described in the previous point are a precursor to full-blown transfer learning. For example, word representations learned by training a model to guess surrounding words in a sentence can then be used for a wide variety of natural language processing tasks, such as part-of-speech tagging, named entity recognition, sentiment analysis, etc. While still not cross-domain (all these are NLP problems), we are seeing models trained to solve one problem contribute to solving other very different problems within the same domain.
  7.  Evolutionary algorithms. The majority of algorithms used to train machine learning models today all follow the same basic principle. This is true for both simple models, such as logistic regression, or backpropagation neural networks, and the newer deep learning approaches. The principle is to use gradient descent or related optimization algorithm to optimize a well-defined function. This function can be the error in the case of supervised learning, or a measure of how well the model represents observed data in the case of the newer unsupervised deep learning approaches.

    But nature does not do gradient descent. Even though the theory of evolution often cites a fitness function, no such function can be formally defined. Organisms exist and interact with each other in an ever-changing environment and what it means to be "fit" is different in every instant of time and every point in space. Rather than optimizing a specific function, the optimization algorithm used by nature is the very simple, yet profoundly powerful duet of variation and selection, which does not require the definition, or existence, of an explicit fitness function.

    Evolutionary algorithms which do not optimize for a well-defined function, but are based purely on variation and selection, have existed for a long time. They are not, however, widely used in training machine learning models today, because they are much less efficient than the algorithms which optimize a well-defined function.

    This is one thing which I believe will change in the next years and will open the path to true AI. The advantage of gradient-descent-like algorithms, the ability to quickly converge in a defined direction, will also be their undoing, since they are not able to optimize for the unknown, or for what may not even exist. Real intelligence is a constantly moving target and cannot be modeled by a simple function.

    My prediction is that in the next years evolutionary algorithms will experience a revival similar to the one neural networks experienced recently. We will start using purely evolutionary algorithms for training models. This will require more computational power, but will ultimately produce much more complex and powerful systems.
  8.  Learning network topologies. The neural network models being trained today have a fixed topology. Training consists of updating the connection weights. Whereas it has been postulated that training the structure itself could be beneficial, there is no good way of doing it using the current learning algorithms. This will change after the previous piece falls into place and we move to evolutionary training algorithms. When that happens, the network topologies will evolve as well. Perhaps at this point the topology will even fully define the model and there will be no need for additional weights.

    I am pretty sure that those evolved topologies will not have that neat man-made layered architecture modern-day neural network models have today. Rather, my bet is that they will resemble other graphs which occur naturally in nature, such as the actual connections in the brain, social networks and airline routes: small-world networks. These topologies evolved naturally and are an effective way of creating highly connected scalable systems in which information flows efficiently.
  9. (Bonus) Redundancy and robustness. Human beings can lose large chunks of their brain sometimes with surprisingly little impairment. Conversely, deep learning networks get thrown off by a couple of distorted pixels. Clearly, real life intelligent systems have a lot of redundancy and robustness built in and true AI will necessarily have it as well. I am not including it in the count, though, because I think that it will be an emergent property of the system, which will just arise naturally once all of the other pieces fall into place.
The past few years have witnessed a continued explosive growth in data sizes and the availability of computational resources. This was coupled with advances in algorithms, such as deep learning, which make use of it and exhibit increasingly intelligent behavior. These behaviors, such as the ability to learn meaningful visual features, or word representations, are increasingly resembling elements of human intelligence.

Yet, we have not been able to create anything which fully resembles true AI, so clearly there is still something missing. I postulate that this missing piece is moving away from optimizing well-defined functions and towards purely evolutionary trial-and-error algorithms. Furthermore, these algorithms should be applied to train model topologies and that these topologies will end up resembling small-world networks. Such approaches will need a lot of computational power and data to train, but those will be available in the near future.

Of course, there is no way of really predicting when and how the singularity will happen... but I do have the feeling that we are getting awfully close. What do you think?

Monday, June 23, 2014

Data dredging and statistical significance explained, or how to win a meritless discrimination lawsuit

I just came up with the following interview question for data scientists:

You apply for a position at a company and interview with five individuals, each of which then casts a hire / no-hire vote. You don't get the job and decide to file a discrimination lawsuit against the company. You don't specify which of the five interviewers was being discriminatory or if you were being discriminated against based on gender, race, religion, age, or sexual orientation. The judge forces the company to disclose historical hire/no-hire decisions for each of the five interviewers and declares that if you can show that one of the interviewers is significantly prejudiced (at p = 0.05) with respect to one of the five characteristics, then they will rule in your favor. Assuming that none of the interviewers are actually biased, what is your chance of winning the case?

Can you guess the answer?

The answer is: approximately 72%. Yep, you are almost three times as likely to win the case than to lose it, even when no actual discrimination is taking place.

This apparent paradox is a typical example of data dredging and it is what people mean when they warn against the perils of big data, or even downfall of science. The problem is not intrinsic to big data or science, but stems from what most problems tend to stem from: people doing things without fully understanding what they are doing.

So I thought I would take this chance to explain some of the key concepts behind statistical significance and data dredging using the discrimination lawsuit paradox as an illustration.

Statistical significance

Let's say you have historical interview decision data for interviewer X and it looks like this:

CandidateGenderX's decision
2femaleno hire
5maleno hire
6maleno hire
8femaleno hire
9femaleno hire
11maleno hire
12maleno hire
13femaleno hire
14maleno hire
16maleno hire
19maleno hire
20maleno hire

The question is: is X discriminating based on gender, or in other words, does X have a bias towards men or women?

We can see that he (or she?) voted hire for 3 out of 11 men (27%) and 5 out of 9 women (56%) — more than double the rate. So it looks like he may have some bias against men. But does this really mean that he is discriminating, or did he just happen to come across some really qualified female candidates and some really unqualified male ones?

This is the question which statistical significance attempts to answer, but it is important to realize that no mathematical system can actually answer this question conclusively, so we go after "the next best thing" and instead try to answer a related question:

"Let's give X the benefit of the doubt and assume that he is not discriminating (the 'null hypothesis'). What are the odds that he would be so unlucky with his male candidates to end up with this amount (or more) of skew towards women?"

To answer this question, let's assume that X was completely blind to the gender when he gave away his eight hire votes and that gender was randomly assigned to the candidates after this fact. If that were the case, the chance of men getting x of the hire votes and women getting 8-x would be:
(11) * (9) / (20), so:
Hire's for menHire's for womenProbability

The p-value is simply the sum of the probabilities in the row in question (in this case 16.504%) and all the rows which are even more "extreme" (in this case, even more biased against men). So in this case the p-value is around 20%, or 0.2, which is usually not considered "statistically significant". The lower the p-value, the more statistically significant the result. Typical p-value thresholds for achieving statistical significance include 0.1, 0.05, or 0.01. So if there had been only two men among the eight that X voted hire for, the bias would have been considered significant having a p-value of around 0.04. But since he (or she) voted hire for three men, the bias should not be considered statically significant.

A common misconception at this point is to assume that this means that the probability that X is biased is 1-p = 80%. But that's simply not how probability works, which we can easily see by applying Bayes rule:

P(X is biased | this or more extreme result)
     = P(NOT null hypothesis | this or more extreme result)
     = 1 - P(null hypothesis | this or more extreme result)
     = 1 - P(this or more extreme result | null hypothesis) * P(null hypothesis) / P(this or more extreme result)
     = 1 - p * P(null hypothesis) / P(this or more extreme result)
     ≠ 1 - p

This misconception keeps getting repeated leading to a lot of confusion, but the truth is that you just can't compute the actual probability that X is biased, unless you have access to all the different parallel universes in some of which X is biased and others isn't.

Note: In real life we would also have to account for the overall rate of hire vs. no-hire decisions for the two genders, which may not actually be equal for reasons unrelated to discrimination. For example, girls may tend to chose different colleges than boys and some colleges may be better than others at preparing candidates for whatever is the job in question. But this would make the math more complicated, so for the purpose of this example, let's assume that the overall rates are the same for men and women in the general population.

Data dredging

Coming back to the original interview question: we have five different interviewers and each of the five interviewers can be prejudiced with respect to five different possible factors (gender, race, religion, age and sexual orientation). This makes 25 possible combinations of interviewer and prejudice. For each of these 25 combinations we can compute the p-value just like we did in the previous section. Assuming that none of the interviewers is actually biased, the probability of any one of these results showing statistically significant bias with a p-value threshold of 0.05 is 0.05 — that's just the sheer definition of p-value. But this means that the probability of at least one of the 25 showing statistically significant bias is 1 - (1-0.05)25 ≈ 0.72. 

In general, the more hypotheses you are testing in hope of uncovering a "statistically significant" one, the more likely that at least one of them will show significance due to pure chance even if no underlying pattern exists (a false positive). This chance gets lower if you lower the p-value threshold, but no matter how low the p-value threshold is, given enough possible hypotheses, the probability of at least one false positive quickly approaches 1 (see graph below). 


This shows that you should always take every statistically significant result with a grain of salt and ask yourself: how many hypotheses were tested before this seemingly significant result was found? You can see many examples of this in medicine (testing numerous drugs to see what works, or testing multiple combinations of genes and conditions to find a correlation), as well as software development (A/B testing multiple features and adopting only the ones which show "significant" improvement). There is nothing wrong with computing p-values, but taking action without knowing what they mean exactly can quite literally have perilous results.

Having gone through this whole chain of thought, I decided that it probably wasn't a good idea to give this interview question to actual candidates, just in case it gives them any ideas... ;)

Friday, January 17, 2014

Why the sum of all natural numbers is not negative one twelfth

This video posted by Numberphile has been making the rounds lately, supposedly proving that the sum of all natural numbers all the way to infinity is equal to negative one twelfth. The guy in the video, Tony, uses pretty basic math to derive this formula and at first glance it may not be immediately clear what is wrong with his derivation. To add to the mystery, he flashes a page from a String Theory book which quotes the formula, albeit in a very different context.

Despite a lot of controversy this video seems to be generating lately, I don't see a good explanation of what exactly is wrong with this proof among the top Google results, so I'm going to try to make it clear here.

Main points

The derivation uses "..." to denote something different in the case of computing S1 and something else in the case of S2, hence the entire derivation is invalid.

There is no interpretation of "..." which, if used consistently alongside sums of natural numbers, would lead to S = 1 + 2 + 3 + ... = -1/12.

Now, mathematical models may exist which use similar notation to the one used for integers and sums, but model phenomena other than natural numbers. Such models may abide by different rules and you may get results which don't carry over to the natural number world. This is probably what the String Theory book is doing (you can check it out at Google Books here). But these mathematical models would still have to be internally consistent in order to be useful, unlike the Numberphile framework which could be used to derive contradictory results, hence has no application other than entertainment.


The first part or the proof concludes that:

S1 = 1 - 1 + 1 - 1 + 1 - 1 ... = 1/2

The argument behind this is that if the sum were cut short at any given point: if this point were odd, the sum would be 1 and if it were even, the sum would be 0. Tony goes on to say: "Do we stop at an odd or even point? We don't know, so we take the average of the two."

Note that this whole argument relies on the assumption that we do stop the series at some point. We just don't know if this point is odd or even, but not stopping at all is not an option. This is an essential distinction.

The traditional mathematical definition of infinite sums, assumes that you never stop. Since you never stop, you never stop at a point which is odd or even, hence the sum is undefined (we'd call this a divergent series).

So what Tony is really computing here, is not the infinite sum, but the expected value of a finite sum stopped at a random place. In more verbose notation we would write what Tony said as:

S1 = 1 - 1 + 1 - 1 + 1 - 1 ... (n times)
S1 = {0 when n is even; 1 when n is odd}
P(n is even) = 1/2
P(n is odd) = 1/2
E(S1) = 1/2 * 0 + 1/2 * 1 = 1/2

There would be nothing wrong with using the infinite sum notation to denote the expected value of a finite sum. In fact, the above definition is called the Cesàro sum and is a well-known mathematical concept. The problem is that having chosen the Cesàro sum as the interpretation of "...", Tony would have to stick to this definition for the remainder of the proof. But he does not do that. In fact, for the rest of the proof, he uses the same notation to mean the actual infinite sum, which never stops.

This is clear, for example, in his derivation of S2. The derivation of S2 is based on adding the series to a version of itself shifted by one, like so:

S2 = 1 - 2 + 3 - 4 + 5 - 6 ...
2 * S2 = 1 - 2 + 3 - 4 + 5 - 6 ...
+ 1 - 2 + 3 - 4 + 5 ...

This is a perfectly valid operation when you are dealing with actual infinite series, because those go on forever. So even if you shift them, they can still align perfectly all the way to infinity. But it does not work under the expected value definition, since the expected value definition assumes that the series stops at some point, in which case the shifted version of S2 would have an extra term at the end which would not align with the un-shifted version.

With all this in mind, here is the video again: