N-gram and its use in text generation

Jan 10, 20213 min read

In natural language processing, it is not only important to make sense of words but context too.

n-grams are one of the ways to understand the language in terms of context to better understand the meaning of words written or spoken.

For example, “I need to book a ticket to Australia.” Vs “I want to read a book of Shakespeare.”

Here the word “book” has different meanings altogether.

In the first sentence, it is used as a verb, which is the action while in the second sentence it is used as a noun, which is an object.

We are able to understand it as we have learned it from childhood in the context of the sentence it is used or the words used before or after the book in the sentence.

Now the question arises how in NLP does it understand, what’s the context of a word?

Machines learn this by seeing the words before and after the word to know about its context.

The answer is through n-grams.

Bi-grams is splitting a sentence into a pair of 2 to see the context.

A machine can understand if an article is used before the book, its Noun, also if 'read' is used in a sentence it is of course a reading book.

Tri-grams is splitting a sentence into 3 sets of words to know about the context. The bigger the window, the harder it is to pick up the words in vocabulary for context.

N-grams define the number of words one needs to look at to see the context.

An example is as below. Bull in the first sentence is an animal while in the second it refers to the share market.

One can also use this to know about the negative context of words like:

“The movie was not nice, awful really.”

The words before and after nice (not nice, nice, awful), cancel the meaning of the positive word nice.

We can also capture sarcasm by this. Sarcasm is an ironic or satirical remark tempered by humor.

An example of the sentence is:

“You are intelligent…not.” It actually means you are not intelligent.

Although a lot of other types of sarcasm like in tonality or remarks or as question are being explored and enriched as we talk.

We use n-grams or pairs of words to know the broader context of the text which is then provided in machine learning to know the real meaning of the text.

N-grams is a simple yet effective approach in Natural language processing to know about the context of words.

Now let’s see how to implement it practically in python.

Install the following packages

!pip install -U pip
!pip install -U dill
!pip install -U nltk==3.4

Import libraries

from nltk.util import pad_sequence
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten

Example sentences

text = [['I','need','to','book', 'ticket', 'to', 'Australia' ], ['I', 'want', 'to' ,'read', 'a' ,'book', 'of' ,'Shakespeare']]

Bigrams can be seen as:

list(bigrams(text[0]))

N-grams can be seen as:

list(ngrams(text[1], n=3))

Now let's try an implementation of the n-gram in text generation, for this let's import the Trump tweet database from Kaggle and put it in a data frame.

import pandas as pd
df = pd.read_csv('../input/trump-tweets/realdonaldtrump.csv')
df.head()

Import tokenize library

from nltk import word_tokenize, sent_tokenize

Now apply to tokenize the tweet column

trump_corpus = list(df['content'].apply(word_tokenize))

import 'every gram' pipeline library

from nltk.lm.preprocessing import padded_everygram_pipeline

Apply n-gram to the corpus

# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, trump_corpus)

Define a maximum likelihood model:

from nltk.lm import MLE
trump_model = MLE(n) # Lets train a 3-grams model, previously we set n=3
trump_model.fit(train_data, padded_sents)

Generate sentences from the model after detokenizing the content.

from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

Generate sentence based on the text in dataframe

generate_sent(trump_model, num_words=20, random_seed=42)

generate_sent(trump_model, num_words=10, random_seed=0)

Conclusion

Text generation through n-grams is a basic method for text generation. RNN and LSTMS are used for more refined generations. The context in N-grams can come with a lot of noise, stop words can be removed to make the text cleaner

Advantages of n-grams

It gives insight at different levels.(bigram, trigram, n-gram)

Simple and conceptually easy to understand.

Disadvantages of n-grams

We may need to use stop words to avoid any noise in results.

A count may not necessarily indicate importance to text or entity.

Thanks for reading!

N-gram and its use in text generation

Recent Posts