The Deep History of Deep Learning: A Perspective

BY DREW MELLOR

The Deep History of Deep Learning: A Perspective

Many of the big news stories in AI recently have involved Deep Learning: it is an enabling technology for self-driving cars like waymo; it beat a world champion at the previously AI resistant two-player oriental strategy game, Go; and it is the speech recognition technology that detects “Hey Siri” for Apple’s Personal Assistant.

Deep Learning refers to artificial neural networks that are organised into a series of layers, and to the novel techniques that have been engineered to train them. But before the field matured into its current state, methodological and engineering innovations stretching back to the 1940s (when AI itself was originating) had to fall into place first.

Looking back, the many threads of Deep Learning come together in a coherent way, but at some stages prevailing wisdom suggested that neural networks were little more than a curiosity.

In this blog post, I would like share with you some of the significant advances and developments from the history of Deep Learning and neural networks.

It is a perspective view, as I shall focus on the aspects of Deep Learning history that I know best from working in natural language processing at D2D (otherwise there would be just too much for this already lengthy post to properly do justice to).

Birth of Neural Networks and AI, Early Enthusiasm to Waning Interest

In 1943, artificial neural networks first appeared in academic literature when Warren McCulloch and Walter Pitts drew together some basic physiology of neuron functioning, aspects of logic and theories of computation and proposed a model consisting of a network of connected neurons.

Many now recognise this landmark model as the earliest work in AI. In this model, the neurons in the network exist in either an ‘on’ or ‘off’ state, where an ‘off’ neuron switches ‘on’ with sufficient stimulation from its neighbours. Initially these networks were fairly static and their capability for practical learning was low.

And so, the primary interest of neural networks at that time was that they were a biologically plausible model of computation.

Nevertheless, Marvin Minsky (a soon to be prominent AI PhD researcher) and his colleague Dean Edmonds used vacuum tubes to simulate 40 neurons, physically instantiating the first artificial neural network in hardware, the SNARC.

By the 1960s improved learning rules had been developed for single layer networks – like the Perceptron. But then Minsky himself showed that the Perceptron was not able to learn to identify if the two inputs to a binary function were different from each other (the ‘exclusive or’ (XOR) problem), and the Perceptron’s popularity declined.

The significance of single layer networks had been oversold. Interest in neural networks receded in the ‘AI winter’ of the 1970s, but around the same time, expert systems and inferential techniques began to flourish and dominate AI research.

Neural Networks Return

The mid-1980s saw things pick up again for neural networks.

Several researchers independently extended the earlier generation of learning rules with the well-known chain rule from calculus. They invented backpropagation (which was actually first developed in the 1960s but it went unrecognised until subsequent reinvention), a technique suitable for training multilayer networks.

This methodological innovation was neatly complemented with a theoretical advance, the Universal Approximation Theorem: whereby a multilayer network with just a single intermediate layer is capable of closely approximating any arbitrary smooth function. This addressed previous doubts raised by the inability of single layer networks to solve the XOR problem.

Recurrent networks (networks that feed their outputs back into their inputs) initially appeared in the early 1980s, but in practise their performance had been mostly disappointing. Their tantalising promise was a way to address learning in sequential data, such as text and speech.

In the mid-1990s, some careful attention to the recurrence mechanism allowed recurrent cells to selectively forget and overcome the main stumbling block, the “exploding/vanishing gradient” problem, leading to a type of artificial neuron called the Long Short-Term Memory cell (LSTM). A refinement, the Gated Recurrent Unit (GRU), increased training efficiency and required less training data.

Nowadays, use of LSTM and GRU is commonplace in natural language processing.

Deep Learning Emerges

Further innovations in the training of neural networks led to more significant advances. For example, it was found that if two tasks resemble each other, a network trained on one can be transferred to the other with some supplementary training and achieve state-of-art results.

This technique is called pre-training. Generally, the amount of training required to learn the second task (the target task) is much less if pre-training has already been applied to the first task (the proxy task). Pre-training is an attractive option when there is an abundant and cheap source of training data available for the proxy-task pre-training.

A good example is the DeepMoji AI, which used the occurrence of emoji in tweets - a plentiful data source - for pre-training prior to learning its target tasks of detecting emotion and sarcasm in text. 

After pre-training, the training process for the target task can be accelerated by freezing the different layers of neurons in succession over a number of training passes that adjust the weights of the remaining layers. Iterative, hierarchical training techniques like this one are very characteristic of Deep Learning.

Recognition, Expansion and Significance

In 2012, individual pieces of the Deep Learning puzzle were instantiated in an eight-layered network, AlexNet, which achieved prominence by decisively winning the Large Scale Visual Recognition Challenge (beating its nearest competitor’s error rate by an unprecedented ten percent).

The challenge involved recognising objects belonging to a wide variety of categories (including, yes, cats and dogs) from images tagged for those categories via crowdsourcing. 

After AlexNet, activity in Deep Learning expanded greatly. Machine learning practitioners found they could apply Deep Learning methods with improved performance, particularly on computer vision and natural language processing tasks, two areas which previously resisted sustained attention within the field of AI.

But the significance of Deep Learning goes beyond winning competitions. Rather, it lies in the capacity to do feature engineering internally, so the practitioner can invest more in network design and less in architecting input features.

Feature engineering is the process of designing how to convert raw data into the inputs of a machine learning algorithm, which can be one of the most challenging and time-consuming aspects of the overall task (eg image processing and natural language processing). The ability of Deep Learning to internally transform raw data into useful features is a significant advantage for these problems.

Looking Ahead

Feature engineering is not the only costly aspect of machine learning. Often there is ample raw data but labelling sufficient volumes of it can be expensive. Crowdsourcing platforms like Mechanical Turk can reduce labelling costs, but labelling millions of training examples is still prohibitively expensive.

Training techniques that reduce the dependence on human labelling (semi-supervised learning) or that do not require any labelling (unsupervised learning and reinforcement learning) may be the way forward.

Semi-supervised Learning

Semi-supervised learning is where a label for the data is naturally available from its context. Eg, hashtags and emojis can be considered labels for tweets. So you can take a large collection of tweets, strip out the hashtags or emojis and train a Deep Learning system to predict the hashtags or emojis that have been stripped out.

As hashtags and emojis often signal topics and emotional content in tweets, especially those that occur at the end of the tweet, predicting them can be used as a pre-training step for topic and emotion detection (eg DeepMoji).

Further Ahead

Throughout its 85-year history, neural networks as a field has been characterised by continual methodological advancement and innovation. This is unlikely to stop soon. Currently, there is considerable interest in attention layers, some calling for them to replace recurrent networks like LSTM and GRU altogether.

Looking further ahead, it seems that machine learning approaches harnessing signal in data without need for human labelling (like reinforcement learning) will play an increasingly important role.

Deep Learning has already been combined with reinforcement learning with notable success - when a network developed by DeepMind learnt to play several classic Atari 2600 games to an expert level from just raw pixel inputs. Reinforcement learning is a pathway that ultimately leads to autonomous agents, but that is another story.

I do speculate that the success of the DeepMind system is a real indicator of the future capability of AI, where genuinely adaptive constructs solve novel, challenging tasks via synthetic learning processes like Deep Learning.

And in winding up this blog post, I’d like to leave you with another speculation: companion AIs, and the transformative effect they could have for segments of society, like care-giving for the elderly and the ill.

Companion AIs may sound fanciful or at least a long way off, but if they do eventuate then it is likely that Deep Learning will play an important role in how they are programmed to understand, communicate and interact with us, humans.