Factoid based natural language question generation system

A ROBUST FACTOID BASED QUESTION GENERATION

SYSTEM

PRESENTED BY

ANIMESH SHAWARITRA DAS

SHREEPARNA SARKAR

CONTENTS• Motivation.• Our Objective.• About Factoid Questions.• Basic Terminology.• Working Procedure.• Rule base Generation.• Question Generation.• Evaluation• Future Scope

MOTIVATION

Google Speech Recognition

Chat bots talking to each other , taken from Cornell Creative Machines Lab

http://www.youtube.com/channel/UCPgIJMsnxPkiVhKlvwu70nA

Google Translator , currently translating English to Bengali.

Cleverbot, A chat bot with a good sense of humor. Taken from http://www.cleverbot.com/

CONTD.

OUR OBJECTIVEGenerate an efficient Question Generation SystemTo generate factoid questions from some text document or corpusGenerate questions from each and every sentence if the sentence has some information else the sentence is discarded

For some sentences more than one type of factoid questions is possible thus attempt generating all such possible types.Take user’s opinion or feedback, and improve the result for further use.

FACTOID QUESTIONS?

Factoid Questions: Type of questions that demand accurate information regarding an entity or an event like person names, locations, organization etc., as opposed to definition questions, opinion questions or complex questions as the why or how questions.

BASIC TERMINOLOGY1. TOKENIZING: Breaking the string into words and Punctuation marks.

e.g. - I went home last night. → [‘I’, ‘went’, ‘home’, ‘last’, ‘night’, ‘.’ ]

2. TAGGING: Assigning Parts-of-speech tags to words.e.g. - cat → noun → NN, eat → verb → VB

3. LEMMATIZING: Finding word lemmata (e.g. - was → be).

4. CHUNKING: A feature that allows grouping of words that convey the same thought and then tagging those sets of words. These tags can be like - Verb phrase, Prepositional Phrase, Noun Phrase.

e.g. → Bangladesh defeated India in 2007 World Cup

CONTD.

5. CHUNKS: ‘Bangladesh’ , ‘defeated’, ‘India’ , ‘in’, ‘2007 World Cup’

6. RELATION FINDING: Finding relation between the chunks, , sentence subject, object and predicates as: RELATIONS:

Bangladesh defeated India in 2007 World Cup NP-SBJ-1 VP-1 NP-OBJ-1 PP-TMP-1 NP-TMP-1

RELATION EXAMPLE

CONTD.

WORKING PROCEDURE1. Taken large train sets of Wh-Questions.2. Broken the sentence into chunks and parsed.3. Found relations.

The Sentence: “Who became the 16th president of the United States of America in 1861”

CHUNKING:['Who', 'NP-SBJ-1']

['became', 'VP-1'] ['the 16th president', 'NP-PRD-1']

['of', 'PP']['United States', 'NP']

['of', 'PP']['America', 'NP']

['in', 'PP']

['1681', 'NP']

Storing the tags in a List

['NP-SBJ-1', 'VP-1', 'NP-PRD-1', 'PP', 'NP', 'PP-1', 'NP-1']

“who”

Storing the tags with the corresponding Wh-Type in a list

['Who', ['VP-1', 'NP-PRD-1', 'PP', 'NP', 'PP-1', 'NP-1']]

4. Determined the wh-type , which is determined by the observing the head word of the question.

CONTD.

RULE BASE GENERATIONThe Parent Tree:

This tree is fed to the system before the training is done. When the system reads a question it determines the Wh-type and traverses to that specific node and starts populating the tree.

POPULATING THE RULE-TREETravelled to the specific wh-node and stored these relations by populating the subsequent nodes of the tree with these chunk relations

NORMALIZED COUNTThis is used to let the parser know when to print the question and when not while backtracking to other child nodes. It is defined as :

Occurrence of every tail node of a question from train set Total number of question of that particular wh-tag

The Count is attached to the tail node only.

Example: ‘Who doesn’t want to rule the world?’Nodes: NP-SBJ-1 VP-1,VBZ-VB-TO-VB NP-OBJ -1,14

Here, this type of question structure appears 14 times in the trained set. The tail will have the count value as integer but when the recursive decent parser parses the question base it normalizes the value. This also provides the user with a more probable question among many other questions.

RULE-TREE WITH NORMALIZED COUNT

While populating, the count of visiting each tail node (the node that holds the last chunk relation) is saved in the corresponding node.

A snapshot of the rule base with count value:

ANSWER PREPROCESSING AND QUESTION TYPE DECISION SYSTEM

• While populating the tree with manually generated questions, the NER tag of the answer for a given question is stored with the corresponding wh-tag. • Only some word(s) are stored.

Example :

“Who is the Father of the Nation? Ans Mahatma Gandhi.”

‘Mahatma Gandhi ‘ on NER tagging :

Mahatma [PERSON]

Gandhi [PERSON]

ANSWER BASEWhen the same tag is found in the answer again and again, the count value is increased accordingly.

The Answer Base

Vocabulary = 4

Vocabulary = 4

Generate Questions from Sentences

PRIORITIZING THE QUESTIONSIt’s possible that there are more than one questions in a single path from root to a leaf. The system will prioritize the questions according to their count-depth product:

Normalized count * depth of tail

Example:

Questions Priority

Who is Mahatma Gandhi? (14/747)*3 = 0.056Who is the father of nation? (21/747)*5 = 0.14

So the 2nd sentence is more likely to be the question that is generally asked if the given sentence is parsed. Although it depends upon the train sets’ questions.

SELECTION OF QUESTION

The probability of each type of probable question is calculated using the following function:

Tag with maximum probability is taken into consideration and that type of question is generated.

F(sentence) = Max(Probability(Words/Wh-tag))

EXAMPLE: TRIGGERING WH-TYPE QUESTIONS

When Tags Count2011 DATE DATE = 19:30PM TIME TIME = 1In 2012 IN IN = 210th OCT DATEIn summer IN

Who Tags Count Grace Badell Person Person = 3John Whiks PersonGeneral Mccllen Person

Sentence: “Sourav was captain of India in the 2003 world cup.”

Chunks: ‘Sourav’ , ‘India’ , ‘in the 2003’ Tags : ‘PERSON’ ‘LOCATION’ ‘IN’

Where Tags CountAsia LOCATION LOCATION = 3Plymouth LOCATION IN = 2In the sea INPacific Ocean LOCATIONIn her eyes IN

Probability(Sourav, India, in the 2003/when)

= Prob(When) * Prob (PERSON/when) * Prob (LOCATION/when) * Prob (IN/when)= (4/13) * (1/(4+3)) * (1/(4+3)) * (2/(4+3))= 0.30 * 0.14 * 0.14 * 0.28 = 0.0016Probability(Sourav, India, in the 2003/where)

= Prob(When) * Prob (PERSON/where) * Prob (LOCATION/where) * Prob (IN/where)= (6/13) * (1/(5+2)) * (3/(5+2)) * (2/(5+2))= 0.46 * 0.1 * 0.3 * 0.2 = 0.0027

So, the system will generate the ‘Who’ type question.

Probability(Sourav, India, in the 2003/who)

= Prob(Who) * Prob (PERSON/who) * Prob (LOCATION/who) * Prob (IN/who)= (3/13) * (3/(3+3)) * (1/(3+3)) * (1/(3+3))= 0.23 * 0.5 * 0.16 * 0.16 = 0.0029

CONTD.

After the training is done , the system will generate questions from sentences by traversing the question base with the values of the nodes.

Example: “Mahatma Gandhi is the Father of Nation.”

Suppose it tries to generate ‘Who’ question from this, then the steps would be :

Sentence parsing:

Mahatma Gandhi is the Father of Nation.Chunks: NP-SBJ-1 VP-1 NP-PRD-1 PP NPTags: NNP – NNP VBZ DT-NN IN NN

The chunks and the corresponding relations are put into a table where the keys are the relations and the values are the chunk phrases

CONTD.

Question Generation:

These relation and tag pairs are searched by a Recursive Descent Parser in the question base. If a path is found with these nodes the corresponding chunks are appended one after another and the question is generated.

“Who is the father of nation?”

The Chunk Table:

CONTD.

THE FEEDBACK SYSTEM• Takes the user feedback on the generated questions.• Updates the count values.• Updates the question base accordingly.• Reduces the generation of False Positives.• Enhances the probability of generation of quality questions.

For reference the image is as:

EVALUATION

Manual Generation(%)

System Generation(%)

Perception 100 58.82

Recall 100 91

Perception : % of selected items that are correct Recall : % of correct items that are selected

We tested the system on a given test dataset and acquire the following results :

SCOPE IN FUTUREQuestion Generation is an important function of advanced learning technologies such as:

• Intelligent tutoring systems• Inquiry-based environments• Game-based learning environments• Psycholinguistics• Discourse and Dialogue• Natural Language Generation• Natural Language Understanding• Academic purposes to create Practice and Assessment materials

REFERENCES[1] Liu, Ming, Rafael A. Calvo, and Vasile Rus. "G-Asks: An intelligent automatic question generation system for academic writing support." Dialogue & Discourse 3.2 (2012): 101-124.

[2] Chen, Wei, and Jack Mostow. "Using Automatic Question Generation to Evaluate Questions Generated by Children." The 2011 AAAI Fall Symposium on Question Generation. 2011.

[3] Radev, Dragomir, et al. "Probabilistic question answering on the web." Journal of the American Society for Information Science and Technology 56.6 (2005): 571-583.

[4] Roussinov, Dmitri, and Jose Robles. "Web question answering through automatically learned patterns." Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries. ACM, 2004.

[5] Agarwal, Manish, and Prashanth Mannem. "Automatic gap-fill question generation from text books." Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 2011.

[6] Skalban, Yvonne, et al. "Automatic Question Generation in Multimedia-Based Learning." COLING (Posters). 2012.

[7] Becker, Lee, Rodney D. Nielsen, and W. Ward. "What a pilot study says about running a question generation challenge." Proceedings of the Second Workshop on Question Generation, Brighton, England, July. 2009.

[8] Xu, Yushi, Anna Goldie, and Stephanie Seneff. "Automatic question generation and answer judging: a q&a game for language learning." SLaTE. 2009.

[9] Rus, Vasile, and C. Graesser Arthur. "The question generation shared task and evaluation challenge." The University of Memphis. National Science Foundation. 2009.

[10] Lin, Chin-Yew. "Automatic question generation from queries." Workshop on the Question Generation Shared Task. 2008.

[11] Ali, Husam, Yllias Chali, and Sadid A. Hasan. "Automation of question generation from sentences." Proceedings of QG2010: The Third Workshop on Question Generation. 2010.

[12] Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python. " O'Reilly Media, Inc.", 2009.

CONTD.

Engineering

Factoid based natural language question generation system