Thursday, July 31, 2014

Eight steps to true AI: we are 75% there


Disclaimer: I am currently between jobs, the opinions expressed here are entirely my own and do not necessarily reflect the opinions of any past or future employers.

With the advent of deep learning, AI winter is officially over. Neural networks are back with a vengeance. Once again imaginations are running wild with post-apocalyptic scenarios in which an evil artificially intelligent being takes over the world.

I am fairly confident that true human-like AI, once we do eventually develop it, will be much more like the Bicentennial Man than Skynet. I am eagerly awaiting the moment. This makes me wonder how much time we actually have left before it happens and what pieces are still missing.

The exciting thing is that a lot of the pieces which just a few years ago I thought were still missing, are largely in place now. The way I see it, only two pieces out of eight still remain before a big breakthrough happens, so in a way we are 75% there. Below is a list of these pieces. All of them have been researched to some extent, but I consider the first six largely a done deal. It is the last two which mostly separate us from the end goal.
  1.  Scale of data and computation. This is a pretty trivial one: computer systems today rarely match the scale of a single human brain, but the sizes are growing rapidly, so it is only a matter of time until we reach a critical mass of raw computational power. What is perhaps less trivial, is the amount of data which needs to be fed to such large systems in order to make them useful, but thanks to our growing reliance on computers and the steady transition of our lives into the virtual world, the amount of available data is also growing exponentially.
     
  2.  Unsupervised learning. This is related to the above. Supervised learning simply doesn't scale, so the only way to learn from the vast amount of data available today is to let an algorithm sort it out on its own. Starting from Google's seminal work, recent results have shown that unsupervised algorithms can tackle problems which up until recently were only achievable using supervised learning.
     
  3.  Neural networks. I was always fascinated by neural networks and never gave up on them (even though I was sometimes embarrassed to admit this once they fell out of grace). Deep learning algorithms have breathed new life into the approach and neural networks are again in the forefront of machine learning research. More than the algorithms themselves, what I think is important is the amount of research going into computing systems designed to mimic the human brain, since there is still so much we can learn from it.
     
  4.  Hierarchical modularity. All complex systems follow a hierarchically modular design in which simpler components are combined to produce ever more complex ones. Up until recently, most machine learning models did not have this property. This changed with the advent of deep learning networks in which each subsequent layer builds on the previous one to learn ever more complex features. Even though this strictly layered structure is not very natural (more on this in point 8), the general idea of modularity is a crucial one.
     
  5.  Information-theoretic approaches to modeling understanding. This recent article shows that there is still a lot of controversy around what it means to understand something and how to model understanding computationally. But modeling understanding may turn out simpler than it seems. It may boil down to what are essentially compression algorithms.

    Two approaches which emerged recently seem to point in this direction. One is sparse coding. Sparse coding is a technique used in deep neural networks. The essence of the idea behind it is to represent the observed data with the smallest generative model possible. As part of a deep neural network, these models learn to represent ("understand"?) complex visual features, such as human (or cat) faces.

    A related approach involves building models which can take incomplete or noisy data and reconstruct the missing parts of it. My favorite example is word2vec which trains a model to guess words in a sentence given the surrounding words. The fascinating thing about it is that the resulting model  vector representations of words  reflects many properties of the words' actual meanings. For example, variations of a word and its synonyms end up close in the vector space. Linear arithmetic on the vectors reflects analogies, such as "Rome is to Italy as Paris is to France", or "man is to king as woman is to queen". In a way, the model learns to "understand" the words.

    Whether these simple information-theoretic approaches are really the key to what we perceive as understanding may forever remain a philosophical question. But the results speak for themselves and so far the results of applying these approaches to computer vision and natural language processing tasks are enough to convince me that they are the key ingredient and that there is not much more to what we perceive as understanding.
     
  6.  Transfer learning. Transfer learning is applying the knowledge gained from learning one domain to a different domain. It is clear that intelligent life forms do this, but most machine learning algorithms used today are trained and evaluated on one very specific problem in one domain only.

    I am going to give this one a check mark anyway, a little bit ahead of time. First, because it is an active area of research and discussion. Second, because the approaches described in the previous point are a precursor to full-blown transfer learning. For example, word representations learned by training a model to guess surrounding words in a sentence can then be used for a wide variety of natural language processing tasks, such as part-of-speech tagging, named entity recognition, sentiment analysis, etc. While still not cross-domain (all these are NLP problems), we are seeing models trained to solve one problem contribute to solving other very different problems within the same domain.
     
  7.  Evolutionary algorithms. The majority of algorithms used to train machine learning models today all follow the same basic principle. This is true for both simple models, such as logistic regression, or backpropagation neural networks, and the newer deep learning approaches. The principle is to use gradient descent or related optimization algorithm to optimize a well-defined function. This function can be the error in the case of supervised learning, or a measure of how well the model represents observed data in the case of the newer unsupervised deep learning approaches.

    But nature does not do gradient descent. Even though the theory of evolution often cites a fitness function, no such function can be formally defined. Organisms exist and interact with each other in an ever-changing environment and what it means to be "fit" is different in every instant of time and every point in space. Rather than optimizing a specific function, the optimization algorithm used by nature is the very simple, yet profoundly powerful duet of variation and selection, which does not require the definition, or existence, of an explicit fitness function.

    Evolutionary algorithms which do not optimize for a well-defined function, but are based purely on variation and selection, have existed for a long time. They are not, however, widely used in training machine learning models today, because they are much less efficient than the algorithms which optimize a well-defined function.

    This is one thing which I believe will change in the next years and will open the path to true AI. The advantage of gradient-descent-like algorithms, the ability to quickly converge in a defined direction, will also be their undoing, since they are not able to optimize for the unknown, or for what may not even exist. Real intelligence is a constantly moving target and cannot be modeled by a simple function.

    My prediction is that in the next years evolutionary algorithms will experience a revival similar to the one neural networks experienced recently. We will start using purely evolutionary algorithms for training models. This will require more computational power, but will ultimately produce much more complex and powerful systems.
     
  8.  Learning network topologies. The neural network models being trained today have a fixed topology. Training consists of updating the connection weights. Whereas it has been postulated that training the structure itself could be beneficial, there is no good way of doing it using the current learning algorithms. This will change after the previous piece falls into place and we move to evolutionary training algorithms. When that happens, the network topologies will evolve as well. Perhaps at this point the topology will even fully define the model and there will be no need for additional weights.

    I am pretty sure that those evolved topologies will not have that neat man-made layered architecture modern-day neural network models have today. Rather, my bet is that they will resemble other graphs which occur naturally in nature, such as the actual connections in the brain, social networks and airline routes: small-world networks. These topologies evolved naturally and are an effective way of creating highly connected scalable systems in which information flows efficiently.
     
  9. (Bonus) Redundancy and robustness. Human beings can lose large chunks of their brain sometimes with surprisingly little impairment. Conversely, deep learning networks get thrown off by a couple of distorted pixels. Clearly, real life intelligent systems have a lot of redundancy and robustness built in and true AI will necessarily have it as well. I am not including it in the count, though, because I think that it will be an emergent property of the system, which will just arise naturally once all of the other pieces fall into place.
The past few years have witnessed a continued explosive growth in data sizes and the availability of computational resources. This was coupled with advances in algorithms, such as deep learning, which make use of it and exhibit increasingly intelligent behavior. These behaviors, such as the ability to learn meaningful visual features, or word representations, are increasingly resembling elements of human intelligence.

Yet, we have not been able to create anything which fully resembles true AI, so clearly there is still something missing. I postulate that this missing piece is moving away from optimizing well-defined functions and towards purely evolutionary trial-and-error algorithms. Furthermore, these algorithms should be applied to train model topologies and that these topologies will end up resembling small-world networks. Such approaches will need a lot of computational power and data to train, but those will be available in the near future.

Of course, there is no way of really predicting when and how the singularity will happen... but I do have the feeling that we are getting awfully close. What do you think?