My notes on “Deep learning” - LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton.
What is deep learning?
It is a kind of representational learning where it automatically discovers abstract representations of the input data. This abstract representation which often not human readable helps the computer make valuable predictions. The reason it is called “deep” is cause it deep learning methods use various layers of representations, each higher level is slightly more abstract than the previous one. Given sufficient such layers, complex functions can be understood. The best part about deep learning is that these features/layers are not designed by humans, they are learned by looking at data using a general purpose algorithm.
What is supervised learning?
Taking classification as an example, during training we show the computer many examples of the each category, we desire it to calculate the highest score for the class to which the example belongs. This is doesnt happen at first.So we calculate the error/deviation from the ideal output using objective function. The machine then makes changes to its internal adjustment parameters AKA weights to minimize this objective-function/error. These weights are like ‘knobs’ which make up the input-output function.
In order to adjust these knobs, the computer calculates gradient, which calculates by what amount the error will improve or worsen when we make tiny changes to each weights. The weights are then revised/changed/modified in manner which reduces the error.
This objective function averaged over all the training examples, can form multi-dementional hilly landscape. The gradient finds the direction of steepest decent in hopes that moving in that direction will get us closer to the minima or lowest point in this hilly arena.
Most people using Stochastic Gradient Decent (SGD), which takes average of gradients of small batches of training examples. This is done for many such batches until the objective function ceases to decrease any further. The reason its is call “stochastic ” is cause each of these batches gives a noisy estimate of the entire example set. This techniques gives a good set of gradients quite quickly when compared to more complicated/elaborate optimization techniques.
A multi layer neural network distorts the input space using non-linear functions in order to make it linearly separable for classification. The diagram above gives an example of this happening. Notice how the grid lines are being morphed too . When we calculate the gradients we are trying to see how changes in the weights (which make up this non-linear function) gives a better transform/distortion which makes it easier for a linear classifier (on the distorted input space) to classify the data.
The conventional option is to hand-design these distortions/feature-extractors, but this is very difficult and time consuming, do with deep learning, these feature extractors are learned during training.
Backpropagation to train multilayer architectures
Using the chain rule of derivatives, we can easily find the derivative/gradient of objective function with respect to the inputs of each layer and ultimately the the original inputs. Using this derivative of objective function with respect to inputs, it is straightforward to compute the gradients with respect to weights of each layer (i.e, how the objective function behaves with change in each weight).
The popular activating function used is ReLU(max(0,z)), as this trains faster when compared to other smoother sigmoid activation functions. Units in between the output layer and the input layer are commonly called hidden layers. These are primarily responsible for distorting the input space in such a way that after the final layer it is linearly separable.
For quite some time researchers thought backpropagation wouldn’t work as it might get stuck in a local minima, but analysis showed that this rarely happened. There might be many saddle points at which the algorithm might get stuck, but it doesn’t matter which one as they all almost had the same objective value.
At CIFAR, unsupervised learning(with unlabeled data) was used to pretrain the nets when less data was available. The pretraining developed sensible future extractors right from the very basic ones to the more complicated ones as more layers were added. This “pretraining” enabled the the weights to be initialized to sensible values. On top these pretrained layers, a new layer was added and then it was trained using backpropagation this time with the data. This fine tuned the weights of the earlier layers to get the desired classification. The same concept is used in transfer learning, where weights from already trained nets on popular benchmark datasets are used for customized or specific applications.
NOTE: Unsupervised pretraining helps avoid overfitting when there is less data.
Convolutional neural networks
ConvNets proved to perform way better on computer vision tasks. they generalize much better and are easier to train. They have beet instrumental in the development of many state of the art computer vision models.
A typical ConvNet is made up of first few layers of convolutional layers and pooling layers. These convolutional layers are made up of feature maps. And each of these feature maps is function of path of previous layers feature maps. This function is through a set of weight called feature banks. Each feature map has its own feature bank.
The reason for this architecture is: 1. Local groups of values are highly correlated. This is taken care of using feature banks aka filters 2. Local statistics are invariant of location for images and other such array signals - hence we use the same feature bank throughout the image. i.e, same set of weights throughout the image.
Role of convolution layer - detect local relations from previous layer Role of pooling layer - merge semantically similar features into one.
Since the relative position of features which make up motifs can vary, we are trying to coarse grain the location of the features in order of it to become more robust. This is typically done using max-pooling, where maximum of the given path of values is taken to the next layer. The neighboring pooling units take the max from the patch which is shifted by more than one unit, this creates invariance to small shifts and distortions.
This convolutional and pooling layers are followed by more convolutional and fully connected layers. The ConvNets are trained using the same principle of backpropagation.
These deep neural nets make use of the fact that images are made of motifs, which make up parts and which inturn make up objects. The same idea of heirarchy exists in audio and text. This fundamental quality of these signals are exploited by deep nets.
Convolutional and pooling layers are inspired by neurons in neuroscience.
Distributed representations and language processing
Neural networks have given a new way of understanding text. Before the advent of neural networks in NLP, the common practice was the use of n-grams, where they took all the possible sequences with the given vocabulary of words. With neural networks, the machine tries to understand the relation ship between words, learns new features for the wards. This is apparent when we plot the words with respect values assigned over learned features. Similar words are closer to each other.
The earlier algorithms could not relate semantically related sequences of words. Neural networks are able to do this quite well.
Recurrent Neural Networks
RNNs process the input data one element at a time. The “element” could be a time step, this maintains their hidden units which contains the entire history for the sequence stored. This means the hidden units are representation of the sequence in a an abstract form. This makes very deep nets were each time step having its own layer.
This technique is used in machine translation, using encoders and decoders. This also used to describe what is present in a particular image. Here the image is represented in some hidden layer of convnet and this abstraction is converted in sequence of words using a decoder network.
One problem with RNNs is that they can only remember a certain lenght of sequence, they start forgetting as we progress further. The solution for this is LSTMs. These use special hidden units which are designed in way to remember inputs for a long time.
And as promised thanks Arnav.