How to perform NLTK on text file ?.
Hi all, welcome to my second blog. I am sure you have read my previous blog. Here, in this blog we will come to know about how to perform NLTK on a text file. We will also refer to it as NLTK's corpus file. Before we start, we might have some questions in our mind as
a) What is the source of the file on which we will perform NLTK?
b) Can we create file and then can we modify it according to our requirement?
c) How to deal with different formats of files?
Yes, definitely we will get answers for all of these questions as we finish going through this blog.
In my previous blog we got the example article from a particular url. For this blog we will try to use a document by creating it in our local system. We can create a blank document or notepad file and after adding some text we can save it as a text file and use our python and 3rd party library and their functions to access it and then perform NLTK on the created text file.
To access the text file we will use python’s built in open() function. So, for our example let us create a file “note.txt”. We need to create the file in the same directory of our IDLE. We can use the below code to access the file.
textfile = open('note.txt')
Most of the time we might get an error as file not found. So first we need to confirm if our file is created in the correct location. We can use IDLE’s open command option in the File menu to display list of all files in the same directory where IDLE is running or we can also check for the list of files within the same directory using the code below.
import os
os.listdir(‘,’)
So, after running the above code we get a list of files existing in the current directory and we should be able to see our text document file in the list. This means our open() method will work as it is able to locate our file. Now we can see below in the code that our open() method also has a second parameter ‘r’ to open file in reading mode.
textfile = open('note.txt','r')
We have to read the file so we need to use the read() method immediately after using the open() method as shown in the code
textfile.read()
Output :
'This is a practice note text\nWelcome to the modern
generation.\nAs technology has advanced in the past few years so
have humans as we have progressed into a faster lifestyle.\nOne of
the most utilised product of technology is the computer.\nToday, a
computer can be found in your personal space at home or carried
around as a laptop or mobile to your office, school, library and
etc. so enjoy digitization and make use of it.\n'
We can see the \n characters in output these are new lines. So we can use strip() method to read every new line separately using for loop as shown below and use print() method to get the content in the console.
f = open('document.txt', 'r')
for line in f:
print(line.strip())
Output :
This is a practice note text
Welcome to the modern generation.
As technology has advanced in the past few years so have humans as we have progressed into a faster lifestyle.
One of the most utilised product of technology is the computer.
Today, a computer can be found in your personal space at home or carried around as a laptop or mobile to your office, school, library and etc. so enjoy digitization and make use of it.
We can also search and access the NLTK corpus file using the find() method using our path for the directory.
filepath = nltk.data.find('Kabitakumar/Downloads/TextDocuments/Corpus.txt')
textfile = open(filepath, 'r').read()
Output:
This is a sample text file.\nThis is a corpus file created for performing NLTK.\nWe can use it as a corpus in our code.
Now we will see how we can use a text file from web page which is in an HTML format from any desired url and save it in our local system to use it as a corpus as a text file.
from urllib import request
url = "https://www.bbc.co.uk/news/uk-53469839"
htmlcontent = request.urlopen(url).read().decode('utf8')
htmlcontent[:60]
Output:
'\n\n<!DOCTYPE html>\n<html lang="en-GB" id="responsive-news">\n<'
After getting the HTML content we can use BeautifulSoup a python library to get the text format of the html content.
from bs4 import BeautifulSoup
from nltk import word_tokenize
rawtext = BeautifulSoup(htmlcontent, 'html.parser').get_text()
As we can see BeautifulSoup() method has two parameters. First one is the html content which we got from our desired URL, and second parameter is parsing the HTML content, then we have used the get_text() method to get the content in text format.
tokenlist = word_tokenize(rawtext)
print(len(tokenlist))
Output:
1907
After this if we want we can print the tokenlist with all 1907 elements or we can specify the desired start and end range and then print tokenlist in console.
tokenlist = tokens[210:247]
tokenlist
Output:
['University', 'A', 'coronavirus', 'vaccine', 'developed', 'by', 'the', 'University', 'of', 'Oxford', 'appears', 'safe', 'and', 'triggers', 'an', 'immune', 'response', '.', 'Trials', 'involving', '1,077', 'people', 'showed', 'the', 'injection', 'led', 'to', 'them', 'making', 'antibodies', 'and', 'T-cells', 'that', 'can', 'fight', 'coronavirus', '.']
So after getting the desired tokens we use NLTK text() method to convert the list of token to type text.
textcontent = nltk.Text(tokenlist)
textcontent
Output:
<Text: University A coronavirus vaccine developed by the University...>
We can also get matching texts by using concordance() method to get matches for 'the'
textfile.concordance('the')
Output:
Displaying 2 of 2 matches:
the University of Oxford appears safe and
Trials involving 1,077 people showed the injection led to them making antibodi
We can also save the modified contents in our corpus using w which allows us to open our corpus in write mode.
newfile = open('note.txt','w+')
After opening the file in write mode only then we can write in the corpus using write() method as we see in the below for loop.
for i in range(10):
newfile.write("Inserting line No. %d\r\n" % (i+1))
After we are done with the changes we can use the close() method and our corpus file gets updated with the new changes.
newfile.close()
Similarly, after using BeautifulSoup on the HTML content we get text format on that we can use the write() method, close() method and save our corpus as a text Files in our local system. Files like ASCII or HTML texts are readable by humans but there are different kinds of file formats PDF, MS Word and other binary formats and to access and read those type of files we can use 3rd party libraries like pypdf, pywin32 and save it as text file and use it as a corpus file and perform NLTK
So now we know, if it is an HTML or a Text content how we can get it in text format as the corpus and create a file it in our local system and then make changes if required and accordingly save it. As we have reached the end of our blog now I am sure, few questions we had in the beginning, are all answered by now. In my next blog, we will get to know about text processing in lowest level for Strings. Thank you, for going through my blog and stay tuned, any time soon, I will put the link to my next blog, and until then, do keep learning…Cheers!!!
Referred links : — http://nltk.org/, http://docs.python.org/.