top of page
Writer's pictureNamrata Kapoor

NLP Easy explanation of common terms with python


Natural language processing (NLP) is a field in which data science deals with linguistics, and artificial intelligence concerned with the interactions between computer systems and human language so as to interpret and analyze natural language in systems, this is an expanding field in data science where various techniques are applied to analyze large amounts of natural language data.


The most popular library used for this work is nltk which can be imported by following line of code.


import nltk


1) Tokenization: Breaking down text into smaller parts is called tokenization. It can be chunking paragraphs into sentences and sentences to words.


# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)
 
# Tokenizing words
word = nltk.word_tokenize(paragraph)




2) Stemming: Reducing many similar words to a stem word or base word is called Stemming.

Most popular stemmer used in English language is called Porter Stemmer, which is a library of nltk.

Example of Stemmer function is as under:



Advantage: This is fast algorithm and can be used in models where speed is required.

Disadvantage: It doesn’t take into consideration the meaning of the stem function, but just reduces it to stem.


#import
 from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]



3) Lemmatization refers to reducing the similar words into its base word which has meaning, i.e. it takes into consideration use of a vocabulary and morphological analysis of words, which aims at removal of inflectional endings only and returning the base which is well defined in a dictionary form, called lemma.

Example is as under:




Advantage: It is mostly used in chat bots as giving meaningful responses is main idea of it.

Disadvantage: It is slower than stemming and where time is main consideration is time.



#import
from nltk.stem import WordNetLemmatizer
wordnet=WordNetLemmatizer()
review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]


4) Stop Words: Words that are not very important in language processing can be removed before applying any model to it, or before processing it for sentiments. These words like is, an, you, the can be called stop words and can be imported from nltk.corpus as ‘nltk.corpus import stop words’.


In Pyhton


#import
from nltk.corpus import stopwords
review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]

5) Bag of Words This is a way of representing and processing words in machine learning algorithms. It represents occurrence and frequency or word. It is a model used in natural language processing in which a is represented as the bag (multiset) of its words, not taking into consideration grammar and even word order but keeping multiplicity.



Advantage:

Its fast and frequency is taken into consideration. Easy to implement.

Disadvantage:

It doesn’t represent data into information, i.e. the meaning of words is lost while doing it. It assumes all words are independent of each other.

Suitable only for small data.

Example:

Sentence 1: She is a very good and decent woman, she is also a good artist.

Sentence 2: He is a bad man but a good driver.

Sentence 3: Man and woman are equal in a decent society.


# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(paragraph).toarray()

6) TFIDF: Term Frequency-Inverse Document Frequency, is a statistical formula in which it is evaluated that how relevant a word is in a text in a collection of texts. It is obtained by product of two metrics, term frequency and inverse document frequency.


Term Frequency= No. of repetition of words in a sentence/No of words in a sentence.


Inverse Document Frequency=log(No. of sentence/No. of sentence containing words)


Its calculation is derived as below:

Example:

Sentence 1: She is a very good and decent woman, she is also a good artist.

Sentence 2: He is a bad man but a good driver.

Sentence 3: Man and woman are equal in a decent society.

By removing stop words from it, sentences can be.

Sentence 1: very good decent woman good artist.

Sentence 2: bad man good driver.

Sentence 3: Man woman equal decent society.




Advantages: Easy to compute and implement. It can give some basic metrics to extract the most descriptive terms in a text. it can easily compute the similarity between 2 texts using it. Search engine can use it.

Disadvantages: TF-IDF doesn’t capture semantics or position of occurrence of words in a text.

In Python:


 # Creating the TF-IDF 
from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer()
X=cv.fit_transform(paragraph).toarray()
 

7) Word2Vec is a technique for natural language processing (NLP). The word2vec algorithm uses a neural network model to learn word semantics and its associations from a large corpus of text. Once trained, such a model can detect similar words or can let us know additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.

The vectors are chosen carefully such that a simple mathematical cosine between vectors indicates the level of semantic similarity between the words represented by those vectors.


Advantages:

This transforms the unlabeled raw corpus into labelled data by mapping the target word to the word it has contextual relation with, thus it learns the representation of words in a classification model.

The mapping between the target word to its contextual relation word embeds the sub-linear relationship into the vector space of words, so that relationships like “king: man as queen: woman” can be inferred by word vectors.

Easy to understand and implement.

Disadvantages:

The sequence of words is lost and hence sub-linear relationships are not very well defined.

The data has to be fed to the model online and may need pre-processing, which requires memory space.

The model could be very difficult to train if the number of categories is too large, i.e. corpus is too big and vocabulary is too large.


In Python it can be implemented as:



from gensim.models importWord2Vec
# Training the Word2Vec model
model=Word2Vec(sentences, min_count=1)
words=model.wv.vocab
# Finding Word Vectors
vector=model.wv['woman']
# Most similar words
similar=model.wv.most_similar('woman')

Hope by reading this many terms must be clear and NLP may seem easier.

Thanks for reading!


645 views

Recent Posts

See All
bottom of page