top of page

How to do NLTK on raw text

Writer's picture: Kabita SorenKabita Soren

1. How to use NLTK to process raw text? Web is undoubtedly the most convenient way to explore a huge collection of texts. Though we have web but sometimes we want to use our own texts, so we will get some answer to our question as how to use NLTK to process raw text. We might have more questions such as a) How can we access texts from local files and probably get unlimited materials for doing the NLTK. b) How can we split up documents and get individual words and then if we want we can do some analysis on that. c) How can we get formatted output and then save it in a file. So to get our questions answered we will be using NLP concepts along with tokenization, stemming and also need to incorporate our python knowledge regarding strings files and regular expressions. 1.1 How to access text from web? For our example we have used the website http://www.gutenberg.org/catalog/ . We will import nltk, and word_tokenize as we have to use them for our nlp. From this website we can select any book we want. We can choose the .txt format and get the URL of the text file. We will use the text file of the book, “ Every-day heroism” This number 62307 is an English translation for the book. Now we can access it as shown below.


from __future__ import division # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize

We need to import request from urllib. Using request.urlopen() we will access the url of the text file. Through read() and decode() we get and store the text file in the variable rawtext. By Using type() function we can see that rawtext variable is a string.


from urllib import request
url = "http://www.gutenberg.org/files/62307-0.txt"
response = request.urlopen(url)
rawtext = response.read().decode('utf8')
type(rawtext)

>>output

str

We can print length of rawtext and also print the range in rawtext to print the first line


len(rawtext)

>>output

25446

After printing the rawtext string we can see the \u, \r, \n are in the first line , these are the escape sequence carriage return character and Unicode code point character as represented by Python.


rawtext[:68]

>>output

'\ufeffThe Project Gutenberg EBook of Every-day heroism, by 
    Anonymous\r\n\r\nT'

Now we have to use language processing which means we will break the string into a list of words and punctuation and that is known as tokenization.

tokens = word_tokenize(rawtext)
type(tokens)

>>output

list

The type() can be used to check that tokens is of type List.


len(tokens)

>>output

4582

We can also get the length of the token. We can get all elements of the list from the list tokens.

tokens[:10]

>>output

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Every-day',
 'heroism',
 ',',
 'by',
 'Anonymous']

So for tokenization we have used NLTK. We will use nltk.next(tokens) function on the tokens and further do more linguistic processing using text.

text = nltk.Text(tokens)
type(text)

>>output

nltk.text.text

We can use the type() function for text and check that it is of type Text.

text[950:965]

>>output

['Napoleon',
 '_you_',
 'can',
 'never',
 'be',
 'a',
 'hero',
 ',',
 'unless',
 'you',
 'have',
 'some',
 'obstacles',
 'to',
 'overcome.']

We can specify range starting value and end value in the nltk text[950:965] and get the list of tokens within the specified range.


text.collocations()

>>output

Project Gutenberg-tm; Project Gutenberg; Literary Archive; United States; Gutenberg Literary; Archive Foundation; electronic work; EVERY-DAY HEROISM; Gutenberg-tm License; Mrs. Morris; copyright holder; Every-day heroism; PROJECT GUTENBERG; said Mrs.; Charlene Taylor; Chuck Grief

We can also use text.collocations() function as it shows the content with header, the name of the article, people who scanned the text etc. We are able to use this because each text content downloaded from project Gutenburg website has all these information in the header and that means they are specified according to collocation.


rawtext.find("[Illustration]")

>>output

1093

Using find() and rfind() we can get the exact index locations for slicing the string in rawtext.

rawtext.rfind("End of the Project Gutenberg EBook of Every-day heroism, by Anonymous")

>>output

6129

We used find() method to detect where the actual content begins and ends and we can overwrite rawtext with the starting from index position 1093 and end of content in index position to 6129 and we can get content within the specified range as shown below.


rawtext = rawtext[1093:6129]
rawtext

>>output

'[Illustration]\r\n\r\n“I will not try any more. Everything goes wrong to-day,” exclaimed\r\nCharley Morris, throwing down his slate in a pet.\r\n\r\n“Nothing succeeds that I try to do. Everything turns out just the wrong\r\nway.”\r\n\r\n“I want you to run and get me the book,” said his mother, “which I left\r\non the seat at the farther end of the garden; then afterwards we will\r\nsee if anything can be done to coax events into a better humour.”\r\n\r\nCharley returned with his face a little brighter from a moment’s\r\nexercise in the fresh air, and seated himself at his mother’s feet.\r\n\r\n“Do you believe in unlucky days, mother?” said he.\r\n\r\n“I do not believe they come very often,” said Mrs. Morris.\r\n\r\n“But how can you help their coming, mother?”\r\n\r\n“Treat them in such a way when they occur that they will not return very\r\nsoon. But now I want you to tell me what has made this day ‘unlucky,’\r\nand then perhaps I can tell you what to do about it.”\r\n\r\n“Well, you see, mother, I overslept myself this morning, and was late\r\nat breakfast. That put me out. Then Agnes laughed at me for being so\r\nlate, and that made me cross.”\r\n\r\n[Illustration]\r\n\r\n“Stop a moment, my dear, and notice where your ‘unlucky day’ began. The\r\ntrifling error in being late in rising cannot excuse the greater fault\r\nof ill-temper. A single act of self control might have altered the\r\ncourse of the whole day.”\r\n\r\n“Then, mother, I went to school feeling just as cross; I thought I had\r\nall my lessons perfectly; but when I got to school, I found I had\r\nlearned the wrong spelling-lesson, and that provoked me a little more,\r\nbut I set to work to learn the right one. While in the midst of that,\r\nthe arithmetic class was called. I had studied the lesson thoroughly\r\nlast night, but somehow the spelling, or being provoked, or something\r\nelse, had put it all out of my head, so that I missed ever so many\r\nquestions: and, to end it all, I have got twelve extra examples to work\r\nout at home. I cannot do them; it is no use trying to do anything on\r\nsuch days.”\r\n\r\nThere was a pause of a few moments, and then his mother said:\r\n\r\n“Charley, you like to read the histories of great soldiers and heroes of\r\nold times, such as Alexander, and Cæsar, and Napoleon?”\r\n\r\n“Yes, mother, very much.”\r\n\r\n“Well, tell me, when do you like Alexander best--feasting at Babylon--or\r\nin action, commanding his army, attacking the enemy, and gaining\r\nvictories?”\r\n\r\n“I like him best in action, mother, of course.”\r\n\r\n“True, we like bravery better than cowardice. When do you like best to\r\nread of Napoleon--imprisoned at St. Helena, or at the beginning of his\r\ncourse with difficulties around him, but rising above them all by his\r\nstrength of will?”\r\n\r\n“Oh, I like him best in the beginning, mother,” said Charley, with\r\nkindled enthusiasm.\r\n\r\n“But,” said Mrs. Morris, “suppose he could have marched by a smooth\r\nroad, straight from France to Italy.”\r\n\r\n“Why, he would not have been a hero at all, if he had not something to\r\nconquer.”\r\n\r\n“And the will to conquer it,” added Mrs. Morris with a smile. “ That is\r\njust what I want you to notice. We cannot imitate, if we would, the\r\nprecise actions of these great conquerors; but we _can_ copy their\r\nenergy and strength of purpose, and our daily life furnishes\r\nopportunities to cultivate these qualities.”\r\n\r\n“I do not see how, mother.”\r\n\r\n“The life of a little school-boy presents some difficulties--does it\r\nnot, Charley?”\r\n\r\n“Yes, mother,” he replied, glancing ruefully at his Arithmetic.\r\n\r\n“Then _there_ is something to conquer, and in the conquest you can grow\r\nstrong and brave. Like Napoleon _you_ can never be a hero, unless you\r\nhave some obstacles to overcome.”\r\n\r\n“I wish the difficulties would not always come when I feel so cross.”\r\n\r\n“The crossness is the very first thing you need to conquer. There is a\r\nproverb on that subject: “He that is slow to anger, is better than the\r\nmighty; and he that ruleth his spirit, than he that taketh a city.”\r\n\r\n“That is an important thing to remember,” said Mrs. Morris. “If we are\r\never to attain anything great or good in life, our career of conquest\r\nmust begin in our own hearts. Until all unruly feelings and passions\r\nare under control, our efforts toward knowledge, or anything else that\r\nis worth the winning, will be of little avail. What people call adverse\r\nfate, is the result of their own faults and failings.”\r\n\r\n“Do you think one can always help feeling unpleasantly, mother?”\r\n\r\n“I think one can learn either to put down all disagreeable feelings, or\r\nto work bravely on and never mind them. But what lessons do you most\r\nfrequently have trouble with, Charley?”\r\n\r\n“Oh! this arithmetic, mother, it is the only thing that troubles me.”\r\n\r\n“I will write on your book, two mottos which I wish you to look at,\r\nwhenever you are fretted, or discouraged by difficulties. The first\r\nis:--‘Every boy may be a hero.’ And that you may remember what sort of\r\nheroism is to be sought, I will add this verse: ‘_He that ruleth his\r\nspirit, is greater than he that taketh a city_.’”\r\n\r\n\r\n\r\n\r\n\r\n'

This was my first blog for beginners like me to get an idea about doing nltk, and this was all about how to access and do NLP on the text available in a website. So in my next blog I will be writing about how to access text files created by us and perform nltk on it so keep visiting my blog for link to my next blog. Referred links : — http://nltk.org/, http://docs.python.org/.

959 views

Recent Posts

See All
bottom of page