Deep learning explained
Deep neural networks can solve the most challenging problems, but require abundant computing power and massive amounts of data
24 May 2019 | 0
Deep learning is a form of machine learning that models patterns in data as complex, multi-layered networks. Because deep learning is the most general way to model a problem, it has the potential to solve difficult problems—such as computer vision and natural language processing—that outstrip both conventional programming and other machine learning techniques.
Deep learning not only can produce useful results where other methods fail, but also can build more accurate models than other methods, and can reduce the time needed to build a useful model. However, training deep learning models requires a great deal of computing power. Another drawback to deep learning is the difficulty of interpreting deep learning models.
“The ideas for “artificial” neural networks go back to the 1940s. The essential concept is that a network of artificial neurons built out of interconnected threshold switches can learn to recognise patterns in the same way that an animal brain and nervous system does”
The defining characteristic of deep learning is that the model being trained has more than one hidden layer between the input and the output. In most discussions, deep learning means using deep neural networks. There are, however, a few algorithms that implement deep learning using other kinds of hidden layers besides neural networks.
Deep learning vs. machine learning
I mentioned that deep learning is a form of machine learning. I’ll refer to non-deep machine learning as classical machine learning, to conform to common usage.
In general, classical machine learning algorithms run much faster than deep learning algorithms; one or more CPUs will often be sufficient to train a classical model. Deep learning models often need hardware accelerators such as GPUs, TPUs, or FPGAs for training, and also for deployment at scale. Without them, the models would take months to train.
For many problems, some classical machine learning algorithm will produce a “good-enough” model. For other problems, classical machine learning algorithms have not worked terribly well in the past.
Deep learning applications
There are many examples of problems that currently require deep learning to produce the best models. Natural language processing (NLP) is a good one.
In Fall 2016, the quality of Google Translate output for English-French, English-Chinese, and English-Japanese language pairs suddenly improved dramatically, from creating word salad to producing sentences close to the quality of a professional translation by a human. What happened behind the scenes is that the Google Brain and Google Translate teams revamped Google Translate from using its old phrase-based statistical machine translation algorithms (one kind of classical machine learning) to using a deep neural network trained with word embeddings using Google’s TensorFlow framework.
That wasn’t an easy project. It took many doctoral-level researchers months of work on models, and thousands of GPU-weeks for training the models. It also spurred Google to create a new kind of chip, a Tensor Processing Unit (TPU), to run neural networks at scale for Google Translate.
In addition to the language translation problem addressed by Google Translate, major NLP tasks include automatic summarisation, co-reference resolution, discourse analysis, morphological segmentation, named entity recognition, natural language generation, natural language understanding, part-of-speech tagging, sentiment analysis, and speech recognition.
Another good example of the application of deep learning is image classification. Because living organisms process images with their visual cortex, many researchers have taken the architecture of the mammalian visual cortex as a model for neural networks designed to perform image recognition. The biological research goes back to the 1950s.
The breakthrough in the neural network field for vision was Yann LeCun’s 1998 LeNet-5, a seven-level convolutional neural network (CNN) for recognition of handwritten digits digitised in 32×32 pixel images. To analyse higher-resolution images, the LeNet-5 network would need to be expanded to more neurons and more layers.
Today’s best deep image classification models can identify diverse catalogues of objects at HD resolution in colour. In addition to pure deep neural networks (DNNs), sometimes people use hybrid vision models, which combine deep learning with classical machine learning algorithms that perform specific sub-tasks.
Other vision problems besides basic image classification that have been solved with deep learning include image classification with localisation, object detection, object segmentation, image style transfer, image colourisation, image reconstruction, image super-resolution, and image synthesis.
Image classification can be expanded to video classification by extracting the individual frames from a video and classifying each frame. Detected objects in video clips can be tracked from frame to frame.
According to Goodfellow, Bengio, and Courville, writing in 2016, deep learning has been used successfully to predict how molecules will interact in order to help pharmaceutical companies design new drugs, to search for subatomic particles, and to automatically parse microscope images used to construct a 3-D map of the human brain.
Deep learning neural networks
The ideas for “artificial” neural networks go back to the 1940s. The essential concept is that a network of artificial neurons built out of interconnected threshold switches can learn to recognise patterns in the same way that an animal brain and nervous system (including the retina) does.
The learning in deep neural networks occurs by strengthening the connection between two neurons when both are active at the same time during training. In modern neural network software this is most commonly a matter of increasing the weight values for the connections between neurons using a rule called backpropagation of error, backprop, or BP.
How are the neurons modelled? Each has a propagation function that transforms the outputs of the connected neurons, often with a weighted sum. The output of the propagation function passes to an activation function, which fires when its input exceeds a threshold value.
In the 1940s and 1950s artificial neurons used a step activation function and were called perceptrons. Modern neural networks may say they are using perceptrons, but they actually have smooth activation functions, such as the logistic or sigmoid function, the hyperbolic tangent, and the Rectified Linear Unit (ReLU). ReLU is usually the best choice for fast convergence, although it has an issue of neurons “dying” during training if the learning rate is set too high.
The output of the activation function can pass to an output function for additional shaping. Often, however, the output function is the identity function, meaning that the output of the activation function is passed to the downstream connected neurons.
Neural network topologies
Now that we know about the neurons, we need to learn about the common neural network topologies. In a feed-forward network, the neurons are organised into distinct layers: one input layer, any number of hidden processing layers, and one output layer, and the outputs from each layer go only to the next layer.
In a feed-forward network with shortcut connections, some connections can jump over one or more intermediate layers. In recurrent neural networks, neurons can influence themselves, either directly, or indirectly through the next layer.
Supervised learning of a neural network is done just like any other machine learning. You present the network with groups of training data, compare the network output with the desired output, generate an error vector, and apply corrections to the network based on the error vector. Batches of training data that are run together before applying corrections are called epochs.
For those interested in the details, backpropagation uses the gradient of the error (or cost) function with respect to the weights and biases of the model to discover the correct direction to minimise the error. Two things control the application of corrections: the optimisation algorithm, and the learning rate variable, which usually needs to be small to guarantee convergence and avoid causing dead ReLU neurons.
Optimisers for neural networks typically use some form of gradient descent algorithm to drive the backpropagation, often with a mechanism to help avoid becoming stuck in local minima, such as optimising randomly selected mini-batches (Stochastic Gradient Descent) and applying momentum corrections to the gradient. Some optimisation algorithms also adapt the learning rates of the model parameters by looking at the gradient history (AdaGrad, RMSProp, and Adam).
As with all machine learning, you need to check the predictions of the neural network against a separate validation data set. Without doing that you risk creating neural networks that only memorise their inputs instead of learning to be generalised predictors.
A deep neural network for a real problem might have upwards of 10 hidden layers. Its topology might be simple or quite complex.
The more layers in the network, the more characteristics it can recognise. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train.
Deep learning algorithms
As I mentioned earlier, most deep learning is done with deep neural networks. Convolutional neural networks (CNN) are often used for machine vision. Recurrent neural networks (RNN) are often used for natural language and other sequence processing, as are Long Short-Term Memory (LSTM) networks and attention-based neural networks. Random Forests, also known as Random Decision Forests, which are not neural networks, are useful for a range of classification and regression problems.
CNN neural networks
Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear downsampling. ReLU layers apply the non-saturating activation function f(x) = max(0,x). In a fully connected layer, the neurons have connections to all activations in the previous layer. A loss layer computes how the network training penalises the deviation between the predicted and true labels, using a Softmax or cross-entropy loss function for classification, or a Euclidean loss function for regression.
RNN, LSTM, and attention-based neural networks
In feed-forward neural networks, information flows from the input, through the hidden layers, to the output. This limits the network to dealing with a single state at a time.
In recurrent neural networks, the information cycles through a loop, which allows the network to remember recent previous outputs. This allows for the analysis of sequences and time series. RNNs have two common issues: exploding gradients (easily fixed by clamping the gradients) and vanishing gradients (not so easy to fix).
In LSTMs, the network is capable of forgetting (gating) previous information or remembering it, in both cases by altering weights. This effectively gives an LSTM both long-term and short-term memory and solves the vanishing gradient problem. LSTMs can deal with sequences of hundreds of past inputs.
Attention modules are generalised gates that apply weights to a vector of inputs. A hierarchical neural attention encoder uses multiple layers of attention modules to deal with tens of thousands of past inputs.
Another kind of deep learning algorithm—not a deep neural network—is the Random Forest, or Random Decision Forest. A Random Forest is constructed from many layers, but instead of neurons it is constructed from decision trees, and outputs a statistical average (mode for classification or mean for regression) of the predictions of the individual trees. The randomised aspects of Random Forests are the use of bootstrap aggregation (a.k.a. bagging) for individual trees and taking random subsets of the features.
Deep learning frameworks
While you could write deep learning programs from first principles, it’s far more efficient to use deep learning frameworks, especially given that they have been optimised for use with GPUs and other accelerators. The pre-eminent framework is TensorFlow, which originated at Google. The favoured high-level API for TensorFlow is Keras, which can also be used with other back-end frameworks.
PyTorch, from Facebook and others, is a strong alternative to TensorFlow, and has the distinction of supporting dynamic neural networks, in which the topology of the network can change from epoch to epoch. Fastai is a high-level third-party API that uses PyTorch as a back-end.
MXNet, from Amazon and others, is another strong alternative to TensorFlow, with a claim to better scalability. Gluon is the preferred high-level imperative API for MXNet.
Chainer, from IBM, Intel, and others, was in some ways the inspiration for PyTorch, given that it defines the neural network by run and supports dynamic neural networks.
While all of the frameworks mentioned above are primarily Python, Deeplearning4j (DL4J), originally from Skymind and now an Apache project, is primarily Java and Scala. DL4J is compatible with Apache Spark and Hadoop.
ONNX was originally proposed as an open ecosystem for interchangeable AI models. ONNX now has a runtime in addition to the interchange file format.
TensorRT, from Nvidia, is another run-time for AI models, specifically to take advantage of Nvidia GPUs. The ONNX runtime can use TensorRT as a plug-in.
Deep transfer learning
Transfer learning is the process of adapting a model trained on one set of data to another set of data. Transfer learning is much faster than training models from scratch, and it requires much less data for the training.
Google Cloud AutoML implements deep transfer learning for vision, translation, and natural language. Azure Machine Learning Service offers similar deep transfer learning services as custom vision, customisable speech and translation, and custom search.
Distributed deep learning training
While TensorFlow has its own way of coordinating distributed training with parameter servers, a more general approach uses Open MPI (message passing interface). Horovod, a distributed training framework for TensorFlow, Keras, and PyTorch that was created at Uber, uses Open MPI as well as Nvidia NCCL. Horovod achieves between 68% and 90% scaling efficiency, depending on the model being trained.
IDG News Service