top of page
Writer's pictureNamrata Kapoor

NLP-Word Embedding made easy

Word Embedding


To tell things briefly and in a meaningful way is the best strategy to communicate. Now in language processing achieving this is not an easy task.

We have already learnt about word2Vec , bagofwords, lemmatization and stemming in my last blog on NLP.


Now here we will try to understand what is word embedding and we will also implement it in python using keras.

When we have a large vocabulary size like 10000 words, vectorizing it and converting it into one-hot encoding will result in a very high dimension dataset which is sparse i.e. lot of zeros and only few index having value one. It will also be hardcoded, as the change in sequence of words will result in different vectors.

So shortcomings of one-hot encoding of large dataset are:

· Sparse dataset

· High-dimensions

· Hard-coded


Example is as below:



To convert this sparse dataset in some meaningful way and to convert it to have some contextual information we can use word embedding.

Few of its advantages are:

· These are dense, having more information.

· Have lower dimensions, mostly 512, 1024 or 2048 dimensions for very large data set.

· And have learnt from data, i.e. it learnt from dimensions and cosine similarity about the word and its context.

It can be presented in the way below:


Please note that word having relationship with dimension will have higher value and non-related will have a lower value.


Now lets see how to do it in keras.


#import one-hot from keras tf
from tensorflow.keras.preprocessing.text import one_hot


### Define sentences
sentence=[  'A plate of pasta',
     'A plate of Pizza',
     'the cup of coffee',
    'the cup of milk',
    'I am a good girl',
     'I am a good leader',
     'leaders make progressive teams',
     'I like your blogs',
     'write more often']

We can define vocabulary size as any value like below:



### Vocabulary size
voc_size=10000

Converting it into one-hot encoding with the help of function.


onehot_repr=[one_hot(words,voc_size)for words in sentence] 
print(onehot_repr)

Output will be as a vector between index 0 to 9999 because the vocabulary size is given as 10000.

[[5080, 8262, 6587, 5680], [5080, 8262, 6587, 7331], [9108, 9471, 6587, 4276], [9108, 9471, 6587, 124], [7246, 6209, 5080, 1874, 7331], [7246, 6209, 5080, 1874, 9225], [2461, 3699, 4829, 3309], [7246, 7057, 7590, 5393], [8770, 9206, 5144]]

Import Keras embedding, sequential model and pad sequences for making all sentences of similar size.

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

Declaring size of sentence as 9 and adding pre-padding to it as 0.


import numpy as np
sent_length=9
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

Output will be as below, numbers represents index of word in sentence and 0 is padding to make them all of similar length.

[[   0    0    0    0    0 5080 8262 6587 5680]
 [   0    0    0    0    0 5080 8262 6587 7331]
 [   0    0    0    0    0 9108 9471 6587 4276]
 [   0    0    0    0    0 9108 9471 6587  124]
 [   0    0    0    0 7246 6209 5080 1874 7331]
 [   0    0    0    0 7246 6209 5080 1874 9225]
 [   0    0    0    0    0 2461 3699 4829 3309]
 [   0    0    0    0    0 7246 7057 7590 5393]
 [   0    0    0    0    0    0 8770 9206 5144]]

I am taking dimensions as 10 in the categories on which words will be defined or their relationship will be established. As it is a small dataset, although for larger ones take 512,1024 etc.

Model is defined as sequential and optimizer used is adam, loss function as mean squared error.

Embeddings are done with parameters as vocabulary size, dimensions, size of sentence.

dim=10
model=Sequential()
model.add(Embedding(voc_size,dim,input_length=sent_length))
model.compile('adam','mse')

Prediction is done as


model.predict(embedded_docs)

It is important to note that each work in sentence will have 10 values in prediction as 10 dimensions were given.

If we minutely see the sentences we would be able to see few same words like in sentence 5 and 6 word 'good' and few similar words like 'coffee' and 'milk' in sentences 3 and 4.

Lets see how these are given the values.

For word 'good' in sentence 5 it is:


print(model.predict(embedded_docs)[4][7])

Output

[-0.04040471 -0.03818582 -0.04255371  0.02705869  0.02397548  0.02389303
  0.01329582  0.03060367  0.00522707 -0.00455904]

For word 'good' in sentence 5 it is:


print(model.predict(embedded_docs)[5][7])

Output

[-0.04040471 -0.03818582 -0.04255371  0.02705869  0.02397548  0.02389303
  0.01329582  0.03060367  0.00522707 -0.00455904]

To see context of words 'coffee' in sentence 3 .



print(model.predict(embedded_docs)[2][8])

Output

[ 0.0118471  -0.00512182  0.04127577 -0.00845211 -0.04556705 -0.02067187
  0.02335585  0.02106038 -0.03808425  0.03151764]

To see context of word 'milk' in sentence 4.



print(model.predict(embedded_docs)[3][8])


Output

[ 0.01100521 -0.03935759 -0.01848283 -0.02745509  0.04155341 -0.0029874
 -0.01868203 -0.01668019 -0.03602111 -0.04758583]

Now if we compare these values we can see for few dimensions values are quite similar like for dimension 1 , 5,6,8 and 9.

These are because both are similar kind of drinks.

These are a bit high dimension data but if we try to draw words on 2 dimension plane many similar words will be found to be near each other like (boy, girl), (king, queen), (milk, coffee).


Conclusion

Word Embeddings is most popular technique for capturing semantic, contextual and and syntactic similarity between words. These are simple vector representations but speak more than that.

I hope after reading this blog you may get an idea on how to get such vector representations which have some meaning to it.

For more such blogs and simple explanations follow me on medium or my linkedIn.

Thanks for reading!

1,136 views

Recent Posts

See All
bottom of page