A Machine Learning Approach to Question Answering on ...pgvle.ucsc.cmb.ac.lk/apgvle/pluginfile.php/25/mod... · 129 Question Answering Systems state that 86% focused on factoid question

A Machine Learning Approach

to Question Answering on

Unstructured Sinhala Documents

Aloka Fernando

(Mphil/PT/2017/062)

Supervisor:

Dr. A. R. Weerasinghe

Outline● Background● Work Overview● Research Gap● Research Questions ● Research Boundary● Methodology● Evaluation● Research Schedule● Research Contributions● References

Background

conventional search results vs answer extracted result

Background

Background

Background● Information retrieval (IR) systems will list the documents that might have the

related information to a information seeking question, but will leave it to the user to locate the needed information from the produced results.

(Dwivedi and Singh 2013)

● Question Answering (QA) have emerged to automatically produce the exact answer or answer phrase asked by humans in natural language using either a knowledge representation or from a collection of natural language documents.

(Kodra and Kajo 2017)

Work Overview

Question Answering Systems based on approaches1)Rule-based approach

2)Statistical approach

3)Machine Learning approach

4)Deep Learning Approach

Work Overview1. Rule-based approach

● Utilized linguistic rules relied on lexical and semantic clues in the task of question answering.

● Syntactic analysis can improve with POS Tagging, Tokenizing, parsing and Lemmatization.

Research Approach Limitations

BASEBALL (Green et al. 1961)

Pattern matching for query generationInterfaces a structured database.

Answers limited to dates and locations.

LUNAR(Woods 1973)

Syntactic processing and Pattern matching towards query representation. Knowledge source a structured database.

Hand crafted rules inadequate to cover the variation in question expressions

QUARC(Riloff and Thelen 2000)

Rules relying on lexical and semantic clues.Data source is unstructured text documents and answer extraction based on keyword matching.

Work Overview

2. Statistical approachData-driven and could also handle the diverseness that the data posses.


Statistical Approaches to Answer finding(Berger et al. 2000)

A collection of answered questions and characterizes the relation between question and answer with a statistical model

Linguistic processing can improve correct identification of relations between question and answers.

The IBM’s statistical QA(Murdock and Tesauro 2016)

Basic NLP analysis of the question is done. (parsing, etc.)statistical analysis is performed at different processing stages towards providing the answer.

The correct answer might be ranked second or third and therefore the scoring algorithm is critical towards returning the correct answer.

Work Overview3. Machine Learning Approach

● Rule-based linguistic features could be learnt by the learning models

● Machine Learning has been used at different stages in the QA process.


Question Classification using with SVM(Zhang et al.2003)

SVM Classifier has been used for Question type classification. The answer type would be derived based on the question type.

Semantic knowledge of the question can be improved with a WordNet

NE Recognition for QA(Molla, van Zaanen, and Smith 2006)

Entropy based classifier has been used for the extraction of Named Entities from the text documents.

Relied on regular expression based feature extraction and Gazzets to identify common NEs

Work Overview4. Deep Learning ApproachDoesn’t require feature extraction which is human dependent and parsing or external Knowledge resource such as a WordNet. Approach of the State-of-the-art QAS.

Research Approach

LSTM Model for Non-factoid Answer selection(Tan et al. 2015)

Embeddings of questions and answers based on bidirectional long short-term memory (biLSTM) models, and measure their closeness by cosine similarity.

Dynamic Memory Networks for Question Answering(Raghuvanshi and Chase 2016)

DMNs can process input sequences, from episodic memories and produce appropriate answers.

Gated Self-Matching Networks for Reading Comprehension and QA(W. Wang et al. 2017)

A stacked bidirectional Long-Short Term Memory (BLSTM) network to sequentially read words from question and answer sentences for answer selection.

Work Overview

Another Classification of QAS– Knowledge Representation based Question Answering

Attempts for question answering has focused on saving knowledge information in the form of production rules, logic, frames, templates (represented with triple relations), ontologies semantic networks and Knowledge graphs. (Dwivedi and Singh 2013)

– Question Answering without Knowledge Representation

Work Overview

QAS for Non-Latin LanguagesUrdu(Thaker and Goel 2015)

It presents the ontological approach to find the answer for the user question.Information is retrieved using ontology and some additional knowledge in the form of adjacent key words.

Hindi(Nanda, Dua, and Singla 2016)

Machine Learning Based on features question type is determined.IR technique is used to retrieve documents

Arabic(Lahbari, Alaoui, and Zidani 2018)

Machine Learning Question Classification based on MLLanguage tools are used for Tokenization, POS Tag. WordNet for query expansion with synonymns.IR is done by integrating Google as the search engine.

Work OverviewQAS for Sinhala

Mahoshadha (Jayakody et al. 2016)

1) A classifier to categorize the summarize documents into predefined categories with the objective of

narrowing down the IR to the documents in the same category.

2) Organizing of documents has done by categorizing them to previously known categories considering its

content using k-Nearest Neighbor (k-NN) Classification

3) Question processing module to determine the answer type based on rule-based and pattern matching

technique

4) Answer processing module first for information retrieval based on a n-gram similarity matching technique

and finally to extract the candidate answers from the passage using keyword and question type and finally

selects the correct answer based on a distance-measure algorithm.

Work Overview

Related work on based on Question TypeQuestion Type Summary

Factoid Type [what, when, which, who]

These are simple and fact based that require answers in a single short phrase or sentenceEg: who is the producer of movie XYZ?(Kolomiyets and Moens 2011)

List type Questions The list questions requires a list of entities or facts in answers.Eg: what are the states in USA?(Indurkhya and Damerau 2010)

Hypothetical type Questions

Hypothetical questions ask for information related to any hypothetical event. They generally begin with ‘what would happen if’.Further the answer would be subjective as there’s no right or wrong answer. (Kolomiyets and Moens 2011)

Work Overview

Related work on based on Question Type (ctd)Question Category Summary

Causal Questions [how or why]

Causal questions require explanations about an entity.(Higashinaka and Isozaki 2008)

Confirmation Questions

Confirmation questions would need the answer in the form of Yes or No.(Mishra and Jain 2016)

Work Overview

● 129 Question Answering Systems state that 86% focused on factoid question

answering while 14% on non-factoid. (Kodra and Kajo 2017)

● The question answering for Sinhala language is at an early stage, hence

author will focus on question-answering for factoid type questions.

Related work on based on Question Type (ctd)

Research Gap● QA for Sinhala is limited to rule-based and machine learning approaches. This is not

inline with the state-of-the-art techniques.

● Further the research does not make use of linguistic features and is weak in terms of

natural language understanding.

● Open domain question answering has been given less emphasis with respective to

QAS in Sinhala.

Therefore the research gap would be to produce an open-domain question

answering system using the state-of-the-art techniques addressing the linguistic

features of Sinhala language.

Research Questions● How to perform open-domain question answering based on Sinhala unstructured

documents ?

● How to build a data set of factoid type questions?

● What kind of machine learning model can most effectively answer factoid questions?

● To what extent can the model built answer factoid questions not in the data set

collected?

● How can features of the Sinhala language be used to achieve better performance?

Objectives of the ResearchThe research would be done to fulfill the following objectives.

● To explore and identify the best machine learning model for factoid questions on

Sinhala unstructured documents.

● To design and construct a representative general purpose factoid question and

answer data set

● To evaluate the effectiveness of the model on other data and other domains of text

Research Boundary

In scope

● As the knowledge source only unstructured text documents in Sinhala language

would be considered.

● The Sinhala Wikipedia will be used as the sampling frame for extracting text and

constructing the question and answer pairs.

● A crowd-sourced data set would be created for evaluation of the question

answering model.

Research Boundary

Out of scope

● The research would be limited to Sinhala language text documents. Code mixed

content would not be considered for the scope.

● Any information other than textual content, such as audio, video, image contained

content would not be considered to provide the answer.

● The questions would be one-off questions and would not be dependent on the

question history or the context set by a previous question.

Methodology 1) Empirical study would be carried out in the research.

2) A quantitative approach would be followed towards evaluating the results.

Methodology 3) Data set would be designed and created towards fulfilling the question

answering task.

Data sets such as Facebook bAbi data set (Weston et al. 2015), SQuAD 2.0

(Rajpurkar et al. 2016), CoQA (Reddy, Chen, and Manning 2018) etc. lead to the

progress of the question answering research for English enabling to adopt state-of-

the-art.

For Sinhala such a data set is not available. Hence using Wikipedia pages for Sinhala

a data set would be created.

Methodology 4) Question Answering system Architecture

Three-modular architecture (Allam and Haggag 2012)

Conventional Question Answering System consists of

mainly Question Processing, Document Processing and

Answer Processing modules

Methodology

● Question processing module - to identify the focus of the

question, classification of the question into its question

class, derives the expected answer type.

● Document processing module - will retrieve the documents

related to the expanded search query. They are further

processed passage level, by adopting a ranking mechanism

based on the keywords of the question.

● Answer processing module - in question answering

system,will lead to producing the correct answer or

answer phrase to the natural language question.

Methodology QAS Architecture for DL based Models (Weissenborn, Wiese, and Seiffe 2017)

● Three-modules in QAS is replaced by an end-to-end deep learning model

where question processing and answer extraction is done

at different layers.

● Embedder - maps question and text document tokens to

their pre-trained embedding

● Encoder - Embedded tokens are further encoded

using such as a (bi-directional) recurrent neural

network (RNN)

Methodology QAS Architecture for DL based Models (Weissenborn, Wiese, and Seiffe 2017)

● Interaction Layer – focus is on interaction between

question and context. Attention-mechanisms are

used in this layer.

● Answer Layer – Predicts answer span

start and the end based on a score

Evaluation● Precision and recall have been the common measurements of a question answering

system. (Calijorne Soares and Parreiras 2018)

● The F-measure is the harmonic mean of the precision and recall. (Allam and Haggag

2012).

Research Schedule

Research Contributions1. Research for Question Answering domain for Sinhala language has been attempted only once. Therefore still the question answering is at the infancy stage. Adopting the state-of-the-art techniques for question answering the research work will be done.

2. Since conversational question-answering is a well researched domain, the research work would be a basis for extending the conventional question answering domain as well for Sinhala language.

3. There’s no publicly available data set for Question Answering for Sinhala. Therefore the data set created would also lead to the progress of the question answering domain for Sinhala language.

4. Provide a basis for designing other Indic language QA including for Tamil

ReferencesAllam, Ali Mohamed Nabil, and Mohamed Hassan Haggag. 2012. “The Question Answering Systems: A Survey.” 2 (3): 12.

Auer, Sören, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. “DBpedia: A Nucleus for a Web of Open Data.” In The Semantic Web, edited by Karl Aberer, Key-Sun Choi, Natasha Noy, Dean Allemang, Kyung-Il Lee, Lyndon Nixon, Jennifer Golbeck, et al., 4825:722–35. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-76298-0_52.

Bian, Weijie, Si Li, Zhao Yang, Guang Chen, and Zhiqing Lin. 2017. “A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection.” In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM ’17, 1987–90. Singapore, Singapore: ACM Press. https://doi.org/10.1145/3132847.3133089.

Bollacker, Kurt, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. “Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge.” In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data - SIGMOD ’08, 1247. Vancouver, Canada: ACM Press. https://doi.org/10.1145/1376616.1376746.

Calijorne Soares, Marco Antonio, and Fernando Silva Parreiras. 2018. “A Literature Review on Question Answering Techniques, Paradigms and Systems.” Journal of King Saud University - Computer and Information Sciences, August. https://doi.org/10.1016/j.jksuci.2018.08.005.

Cheri, Joe, and Pushpak Bhattacharyya. 2017. “Towards Harnessing Memory Networks for Coreference Resolution.” In Proceedings of the 2nd Workshop on Representation Learning for NLP, 37–42. Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-2605.

Dwivedi, Sanjay K., and Vaishali Singh. 2013. “Research and Reviews in Question Answering System.” Procedia Technology 10: 417–24. https://doi.org/10.1016/j.protcy.2013.12.378.

ReferencesForner, Pamela, Danilo Giampiccolo, Bernardo Magnini, Anselmo Peñas, Álvaro Rodrigo, and Richard Sutcliffe. 2010. “Evaluating Multilingual Question

Answering Systems at CLEF,” 9.

Green, Bert F., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. “Baseball: An Automatic Question-Answerer.” In Papers Presented at the

May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference on - IRE-AIEE-ACM ’61 (Western), 219. Los Angeles, California: ACM

Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. “LSTM: A Search Space Odyssey.” IEEE

Transactions on Neural Networks and Learning Systems 28 (10): 2222–32. https://doi.org/10.1109/TNNLS.2016.2582924.

Hao, Xiaoyan, Xiaoming Chang, and Kaiying Liu. 2007. “A Rule-Based Chinese Question Answering System for Reading Comprehension Tests.” In Third

International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), 325–29. Kaohsiung, Taiwan:

Higashinaka, Ryuichiro, and Hideki Isozaki. 2008. “Corpus-Based Question Answering for Why-Questions,” 8.

Hirschman, L., and R. Gaizauskas. 2001. “Natural Language Question Answering: The View from Here.” Natural Language Engineering 7 (4): 275–300.

Indurkhya, Nitin, and Fred J Damerau. 2010. “Natural Language Processing.” Machine Learning, 676.

Ishwari, K S D, A.K.R.R Aneeze, Y Mallawarachchi, and H.J. D.A Karunarathne. 2019. “Advances in Natural Language Question Answering: A Review,” 7.

Jayakody, J. A. T. K., T. S. K. Gamlath, W. A. N. Lasantha, K. M. K. P. Premachandra, A. Nugaliyadde, and Y. Mallawarachchi. 2016. “‘Mahoshadha’, the Sinhala Tagged Corpus Based Question Answering System.” In Proceedings of First International Conference on Information and Communication Technology for Intelligent Systems: Volume 1, edited by Suresh Chandra Satapathy and Swagatam Das, 50:313–22. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-30933-0_32.

ReferencesKatz, Boris. 1997. “Annotating the World Wide Web Using Natural Language,” 7.

Kodra, Lorena, and Elinda Kajo. 2017. “Question Answering Systems: A Review on Present Developments, Challenges and Trends.”

International Journal of Advanced Computer Science and Applications 8 (9). https://doi.org/10.14569/IJACSA.2017.080931.

Kolomiyets, Oleksandr, and Marie-Francine Moens. 2011. “A Survey on Question Answering Technology from an Information Retrieval Perspective.” Information Sciences 181 (24): 5412–34. https://doi.org/10.1016/j.ins.2011.07.047.

Kumar, Ankit, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing,” 10.

Kwok, Cody, Oren Etzioni, and Daniel S Weld. 2000. “Scaling Question Answering to the Web,” 22.

Lahbari, Imane, Said El Alaoui, and Khalid Zidani. 2018. “Toward a New Arabic Question Answering System” 15 (3): 10.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.

Li, Juzheng, Hang Su, Jun Zhu, Siyu Wang, and Bo Zhang. 2018. “Textbook Question Answering Under Instructor Guidance with Memory Networks.” In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3655–63. Salt Lake City, UT, USA: IEEE. https://doi.org/10.1109/CVPR.2018.00385.

Mikolov, Tomas, and Geoffrey Zweig. 2012. “Context Dependent Recurrent Neural Network Language Model.” In 2012 IEEE Spoken Language Technology Workshop (SLT), 234–39. Miami, FL, USA: IEEE. https://doi.org/10.1109/SLT.2012.6424228.

https://doi.org/10.1109/CVPR.2018.00385

ReferencesMishra, Amit, and Sanjay Kumar Jain. 2016. “A Survey on Question Answering Systems with Classification.” Journal of King Saud University - Computer and Information Sciences 28 (3): 345–61.

Murdock, J William, and Gerald Tesauro. 2016. “Statistical Approaches to Question Answering in Watson,” 8.

Nanda, Garima, Mohit Dua, and Krishma Singla. 2016. “A Hindi Question Answering System Using Machine Learning Approach.” In 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), 311–14. New Delhi, India: IEEE.

Ojokoh, Bolanle, Department of Information Systems, Federal University of Technology, Akure, Nigeria, Emmanuel Adebisi, and Department of Computer Science, Federal University of Technology, Akure, Nigeria. 2019. “A Review of Question Answering Systems.” Journal of Web Engineering 17 (8): 717–58. https://doi.org/10.13052/jwe1540-9589.1785.

Otter, Daniel W., Julian R. Medina, and Jugal K. Kalita. 2018. “A Survey of the Usages of Deep Learning in Natural Language Processing.” ArXiv:1807.10854

Rajpurkar, Pranav, Robin Jia, and Percy Liang. 2018. “Know What You Don’t Know: Unanswerable Questions for SQuAD 2.0.” ArXiv:1806.03822 [Cs],

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” ArXiv:1606.05250 [Cs],

Rebele, Thomas, Fabian Suchanek, Johannes Hoffart, Joanna Biega, Erdal Kuzey, and Gerhard Weikum. 2016. “YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames.” In The Semantic Web – ISWC 2016, edited by Paul Groth, Elena Simperl, Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, and Yolanda Gil, 9982:177–85. Cham: Springer International Publishing.

ReferencesReddy, Siva, Danqi Chen, and Christopher D. Manning. 2018. “CoQA: A Conversational Question Answering Challenge.” ArXiv:1808.07042

[Cs], August. http://arxiv.org/abs/1808.07042.

Riloff, ٖEllen, and Michael Thelen. 2000. “A Rule-Based Question Answering System for Reading Comprehension Tests.” ANLP-NAACL 2000

Workshop: Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.

Senevirathne, K.U., N.S. Attanayake, A.W.M.H. Dhananjanie, W.A.S.U. Weragoda, A. Nugaliyadde, and S. Thelijjagoda. 2015. “Conditional

Random Fields Based Named Entity Recognition for Sinhala.” In 2015 IEEE 10th International Conference on Industrial and

Information Systems (ICIIS), 302–7. Peradeniya, Sri Lanka: IEEE. https://doi.org/10.1109/ICIINFS.2015.7399028.

Tan, Ming, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. “LSTM-Based Deep Learning Models for Non-Factoid Answer Selection.” ArXiv:1511.04108 [Cs], November. http://arxiv.org/abs/1511.04108.

Thaker, Rukhsana, and Ajay Goel. 2015. “Domain Specific Ontology Based Query Processing System for Urdu Language.” International Journal of Computer Applications 121 (13): 20–23. https://doi.org/10.5120/21601-4712.

Trischler, Adam, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. “NewsQA: A Machine Comprehension Dataset.” ArXiv:1611.09830 [Cs], November. http://arxiv.org/abs/1611.09830.

Wang, Di, and Eric Nyberg. 2015. “A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering.” In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 707–12. Beijing, China: Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-2116.

http://arxiv.org/abs/1511.04108

ReferencesWang, Wenhui, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. “Gated Self-Matching Networks for Reading Comprehension and Question Answering.” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 189–98. Vancouver, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1018.

Weinzierl, Hadley. 2013. “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East - United States,”

Weissenborn, Dirk, Georg Wiese, and Laura Seiffe. 2017. “Making Neural QA as Simple as Possible but Not Simpler.” ArXiv:1703.04816 [Cs], March. http://arxiv.org/abs/1703.04816.

Weston, Jason, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks.” ArXiv:1502.05698 [Cs, Stat], February. http://arxiv.org/abs/1502.05698.

Woods, W A. 1973. “Progress in Natural Language Understanding—An Application to Lunar Geology,” 10.

Yao, Xuchen, and Benjamin Van Durme. 2014. “Information Extraction over Structured Data: Question Answering with Freebase.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 956–66. Baltimore, Maryland: Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-1090.

Zhang, Xin, An Yang, Sujian Li, and Yizhong Wang. 2019. “Machine Reading Comprehension: A Literature Review.” ArXiv:1907.01686 [Cs], June. http://arxiv.org/abs/1907.01686.

Questions

Thank You!!