Processing language has come a long way. Starting from Bag of words , to Recurrent Neural networks to Long Short Term Memories, and after overcoming the problems each had, it has improved.
Bag of Words was kind of sparse matrix where if we have vocabulary of 10 million words each word will be represented by a sparse matrix with majority of zeroes and a one where index of word is.
RNNs were good to handle sequence of words but there was a problem of vanishing and exploding gradient problems, it was very good at keeping sequence information but not very long term ones. Also vanilla models couldn’t have context of words coming in future of sequence.
To resolve that bidirectional recurrent neural networks came which had input as the hidden state of previous words and even next words.
LSTMs came to rescue for RNN problem of short term memory, this had complex cell state where important information was given importance and long sequences could make sense. However this model too was not without any disadvantage, being slow to train, very long gradient paths like for 100 words has 100 layer network, needs labelled data for the task they do, transfer learning doesn’t work well on it, can be a bit slow.
Transformers are taking the space of Natural Language processing by storm.
These are pushing boundaries and are being used in any of the NLP assignments which includes, question answer chatbots, translation models, text generation or even search engines.
These are rage and doesn’t need to be pre-trained and are good example of transfer learning.
Now how do they work?
Transformers outperform RNNs, GRUs or LSTMs. These doesn’t have a chain structure to keep sequence intact.
BERT, and GPT are examples of transformers only.
To understand transformers, we must first know about “Attention” mechanism.
In the text generation model, transformer is fed input and has also knowledge of previous words based on which it predicts the words ahead.
Each word knows that attention is to be given to which word said previously based on their mechanism of attention.
How the model learns is by back propagation, in recurrent neural network window of backward propagation was much lesser than that of transformers.
LSTMS also have this mechanism but transformers can do it on infinite window of memory if given enough resources.
Transformers are attention based encoder –decoder type of model where encoder is where inputs are processed and turned into a continuous presentation of inputs that holds every learned information of that input.
The decoder then take that information from encoder and step by step decodes and gives the output, while also being fed the previous output.
Encoder
Step 1: Word Embedding
In Encoder input is given as word embedded layer, it is then combined with positional encoding, which keeps the information of positions.
As we are not using RNNs here sequential information is kept as positional encoding which is very smart way of retaining the positions of text.
Step 2 : Positional Encoding
Positional encoding is done by using sine and cosine functions which are represented by
These positional encodings are added to their word embedding factors to get the positional input encodings.
Sin and cosine functions have linear properties which the model can easily compute.
Step 3 : Multi-headed Attention
The input is fed into 3 fully connected layers to form query, key and value vectors.
a) Self Attention: To make query , key and value vectors
Example of query is search text on youtube or google, key is the video title or article title searched for associated with the query text.
The query and keys are dot products to make scores, highest scores are for those words which are to be given more attention in search.
In short multi-headed attention model is a vector representation of the input and its sequence with importance of each word which has to be given attention to. And how each word will attend to all other words in sequence.
Scores are scaled down by dividing it by square root of dimensions of queries and keys.
This is to have more stable gradients, as multiplying values can have exploding gradient problem. These are called scaled scores.
Now Softmax is applied to scaled scores to get probability between 0 to 1 for each word, the higher probability words will get more attention and lesser values will be ignored.
Softmax(x)i=exp(xi)/ ∑exp(xj)
Now these attention weights are multiplied by value words to get output.
The higher softmax scores will keep the values to be given more attention on higher side.
Lower scores will be termed as irrelevant.
Output vectors are added to linear layer to be processed.
You will get 2 output vectors if we have 2 self attention models. Both model values are concatenated and given to linear layer for processing.
In theory each head learns something different so two heads are considered for the final output from linear layer, this is just to give the model more representation power.
Step 4:
Multi-headed output is connected to an original input, this is called residual connection.
The output of residual connection is given to layer Normalization.
The layer normalization output gets fed into feed forward network for further processing. The feed forward network consist of a couple of linear layers with Relu activation in between.
The output of it is again added to input of feed forward network and again fed into a layer of normalization and further normalized.
The residual connections help the network to train by long gradients to flow through the networks directly.
The layer normalization is used to stabilize the results and substantially producing the training time necessary.
Point-wise feed forward is used to further process the attention output and giving it a weighted representation.
All this is to encode the input to get a continuous information with attention information. This will help decoder to give attention to important words during the process of decoding.
Encoder layers can be stacked N times to further encode information where each layer can learn different attention representations. This can boost and powerfully the predictive power of transformer.
Decoder layer
The decoder has a job of generating text sequences, it has similar layers as encoders. It has 2 multi-headed attention layers, 1 point-wise feed forward layer with residual connections and point-wise feed forward layer.
Decoder Multi-Headed Attention
These layers behave similar to encoder but have different job. It has a linear layer and softmax to calculate probabilities as outputs.
The decoders take the previous layer outputs as inputs as well as encoder outputs which have attention information.
The decoders stop decoding when they generate the n tokens as output.
Step 1:
The input goes through embedding layer and positional encoding layer to get positional embedding.
These embedding get fed into the first multi-headed attention layers which computes the attention score for decoders.
Look Ahead Mask
Since the decoders generate the sequence word by word, it has to be prevented from conditioning from future tokens.
Example: If an output to question “Hi, how are you doing?” is “I am fine”
The word ‘am’ should not have an access to ‘fine’.
It should have only access to itself and words before it.
This has to be same for all other words.
Method to prevent words to see future ones is called masking, which mass the words ahead of them.
A look ahead mask is added to scaled scores to get masked scores. In look ahead mask scores Top right triangle of scores are negative infinities, hence resulting masked scores which have negative infinities.
When softmax is applied to masked scores negative infinities are zeroed out, which were for future words, leaving behind values for only unmasked numbers, which were of words in past.
This masking is the only difference on how the attention scores are calculated in the first multi-headed attention mode. There are N multi-headed layers and masking applied to them before getting concatenated to the linear layer for further calculations.
The output of first multi-headed layer is a masked vector, which has information on how to attend on decoder’s input.
On second multi-headed attention layer , for this encoder input is the queries and the keys and the first multi-headed layer output.
This matches the encoder input to the decoder’s input allowing decoder to decide on which encoder input to perform further calculations.
Linear Classifier
The output of this layer is fed into point-wise feed forward layer for further processing.
The output of final feed forward layer goes to final linear layer to access the classifier.
The output of classifier gets fed into the Softmax, which produces the probability scores between 0-1 for each class. The index of highest probability is taken and that is the predicted word.
The decoder taken then next input and keeps predicting till the last word.
Different N number of decoders can be stacked each layer taking input from encoders and layers before it and it can focus on different attention models and probabilities are taken into account before choosing the best output with its boosted power.
Conclusion
Hence, we say that transformers, have high capability to predict the next word in conversation as well as have larger attention frame, where in each word can be seen for how much attention each word is giving to the other.
This encoder and decoder phenomenon can be used to handle a series of problems like question and answers, predicting next words in text generation, translations etc.
With advancement new advanced transformers are being developed for better handling of problems.
BERT,GPT etc are all built on same phenomenon.
Thanks for reading!