If you have a stack of websites or a stack of pdf files and you want to have answers to your questions, it looks like a hell of a task.
What if we can do it in a few lines of code using BERT.
BERT is a pre-trained transformer-based model. Here we will be using bert-squad1.1, this model is pre-trained on squad (Stanford Question Answer Dataset) dataset, which is a typical benchmark problem for question-answer models. It consists of more than 100,000 questions based on Wikipedia snippets. Also, it is annotated with the corresponding text span i.e has the start and end of answers marked.
There is also a lighter version of Bert called distilbert but it has 40% lesser parameters to be trained on.
It will be faster but can have lesser accuracy.
Now let's get started first of all decide on the kind of input(pdf or website or anything).
Here I am going to show you two ways.
1) The first step is to install cdqa.
CDQA package has packages to convert PDF files into searchable text in data frames
Closed Domain Question Answering is an end-to-end open-source software suite for Question Answering using classical IR methods and Transfer Learning with the pre-trained model BERT
pip install cdqa
2) CDQA also has QAPipeline whereinto the documents will be fitted.
import os
import pandas as pd
from ast import literal_eval
from cdqa.utils.converters import pdf_converter #converts to pandas dataframe
from cdqa.pipeline import QAPipeline #Question Answer Pipeline
from cdqa.utils.download import download_model #to download the pre-trained model
3) The download model is to download a pre-trained model that we will use.
download_model(model='bert-squad_1.1', dir='./models')
4) Now make a directory in which all pdfs will be downloaded.
!mkdir docs
5) Download all files into this path
!wget -P ./docs/ https://s2.q4cdn.com/299287126/files/doc_financials/2020/q3/AMZN-Q3-2020-Earnings-Release.pdf
!wget -P ./docs/ https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf
!wget -P ./docs/ https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf
6) Use pdf Converter to convert pdf text into a pandas data frame and visualize the dataset.
df=pdf_converter(directory_path='./docs/')
df.head()
7) Call the pipeline to read the model
cdqa_pipeline=QAPipeline(reader='./models/bert_qa.joblib',max_df=1.0)
8) Fit the pipeline over to the data frame created
cdqa_pipeline.fit_retriever(df=df)
9) Now you can start asking questions and answers will be predicted
query='How many full time employees are on Amazon roll?'
prediction=cdqa_pipeline.predict(query)
Output
650,000
Q2-2020-Amazon-Earnings-Release
paragraph:Supporting Employees• Amazon’s top priority is providing for the health and safety of our employees and partners, and the company ----
10) Now to convert this answer into nice formatting we can write as :
print('query:{}'.format(query))
print('answer:{}'.format(prediction[0]))
print('title:{}'.format(prediction[1]))
print('paragraph:{}'.format(prediction[2]))
It will answer the question and even show which document it searched and the paragraph it was in.
For other kinds of databases, you can also scrape data through beautifulsoup and use it or just download the data convert it to CSV, and work. An example is as below:
In step 2 add this line.
from cdqa.utils.filters import filter_paragraphs
In step 4 and 5 do this
from cdqa.utils.download import download_model, download_bnpp_datadownload_bnpp_data(dir='./data/bnpp_newsroom_v1.1/')
In place of step 6, you can do as below.
df = pd.read_csv('./data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})df = filter_paragraphs(df)df.head()
Conclusion
Remember if you have lots of data do not forget to turn on your GPUs, also BERT may take a bit of time to respond although its efficiency is very good because of its long attention range and positional encoding qualities. Which you may have already read in my blog about transformers. We will talk more about efficient models for chatbots or even talking chatbots soon.
Thanks for reading!