Download pdf - Sinmin Literature Review Presentation

SINMINCORPUS FOR SINHALA LANGUAGELiterature Review

Upeksha W. D.

Wijayarathna D. G. C. D.

Siriwardena M. P.

Lasandun K. H. L.

Supervisors :

Dr. Chinthana Wimalasuriya

Prof. Gihan Dias

Mr. N. H. N. D. De Silva

Sinmin is a Corpus for Sinhala language which is

➢Continuously updating

➢Dynamic (Scalable)

➢Covers wide range of language (Structured and

unstructured)

OUTLINE

● Literature Review

● Introduction to corpus linguistics and What is a Corpus

● Usages of a corpus

● Existing Corpus Implementations

● Identifying Sinhala Sources and Crawling

● Data Storage and Information Retrieval from Corpus

● Information Visualization

● Extracting Linguistic Feature

● Current Progress

INTRODUCTION TO CORPUS LINGUISTICS

AND WHAT IS A CORPUS

Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken

business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201

WHAT IS A CORPUS??

“A corpus is a principled collection of authentic texts

stored electronically that can be used to discover

information about language that may not have been

noticed through intuition alone.” - Bennet (2010)

Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.

● There are mainly 8 kinds of corpora.

● They are generalized corpuses, specialized corpuses,

learner corpuses, pedagogic corpuses, historical

corpuses, parallel corpuses, comparable corpuses,

and monitor corpuses.

● The broadest type of corpus is the genarilezed

corpes.

“Sinmin” will

be a generalized corpus.

cover all types of Sinhala Language.

USAGES OF A CORPUS

● Implementing translators, spell checkers and grammar

checkers.

● Identifying lexical and grammatical features of a language.

● Identifying varieties of language of context of usage and

time.

● Retrieving statistical details of a language.

● Providing backend support for tools like OCR, POS Tagger,

etc.

EXISTING CORPUS IMPLEMENTATIONS

● There is a implemented corpus for Sinhala language

which is known as UCSC Text Corpus of

Contemporary Sinhala.

● It consists of about 10 million words, but it covers

very little amount of language and it is not updating.

CORPUS FOR SINHALA LANGUAGE?

COMPOSITION OF THE CORPUS

● Language comprising the corpus cannot be random

but chosen according to specific characteristics.

● It must use authentic texts. The language it contains

is not made up for the sole purpose of creating the

corpus

EXAMPLE - COMPOSITION OF COCA

● The COCA contains more than 385 million words

from 1990–2008 (20 million words each year).

● Texts are evenly divided between 5 genres, spoken

(20%), fiction (20%), popular magazines (20%),

newspapers (20%) and academic journals (20%).

COMPOSITION OF UCSC TEXT CORPUS OF

CONTEMPORARY SINHALA

DATA STORAGE AND INFORMATION

RETRIEVAL FROM CORPUS

Existing corpora uses two main technologies for data

storage

● Relational Databases

● Indexed file Systems

INDEXED FILE SYSTEMS AS STORAGE

● BNC uses this mechanism.

● data is stored as XML like files which follows a

scheme known as the Corpus Data Interchange

Format.

● This supports to store a great deal of detail about the

structure of each text, such as its division into

sections or chapters, paragraphs, verse lines, etc.

RELATIONAL DATABASE AS STORAGE

● COCA, Corpus del Español use relational databases.

DATA MODEL IN COCA

CORPUS DEL ESPAÑOL USES SEPARATE

TABLES FOR BIGRAMS AND TRIGRAMS

RELATIONAL DB VS INDEXED FILE

SYSTEMS

● Indexed file systems use extensive use of indexes

● Relational Database models are relatively fast.

● In Indexed file systems, difficult to add additional

layers of annotation.

No study has been done on

how NoSQL performs in

implementing Corpora.

INFORMATION VISUALIZATION

Most of the popular corpora like BNC, COCA, Corpus

Del Espanol, Google books corpus use similar kind of

Web Interface.

USER INTERFACE

OF COCA

GOOGLE BOOKS NGRAM VIEWER UI

EXTRACTING LINGUISTIC FEATURES

● A main usage of a language corpus is extracting

linguistic features of a language.

● Linguistic features for many languages has been

identified using Corpora.

● Example - A corpus-based linguistics analysis on

written corpus: colligation of “TO” and “FOR.”

CURRENT PROGRESS

IDENTIFIED SINHALA RESOURCES

● Online Newspapers

● News Websites

● School Textbooks

● Sinhala Wikipedia

● Online Mahawansaya

● Subtitles

● Sinhala Fiction

● Sinhala Blogs

● Sinhala Magazines

● Gazette

DIVIDED INTO 5 MAIN GENRES

News Academic Creative

Writing

Spoken Gazette

News Paper Text books Fiction Subtitle Gazette

News Items Religious Blogs

Wikipedia Magazine

mahawansa

Implemented Crawlers for different sources,

adhering to same format.

https://github.com/madurangasiriwardena/corpus.sinhala.crawler

https://github.com/madurangasiriwardena/corpus.sinhala.crawler

FINISHED CRAWLERS

CRAWLED DATA SAVED TO XML FILES WITH

FOLLOWING META DATA

● Post Name

● Author

● Link

● Published Date

CRAWLER CONTROLLER

Crawler controller monitors and handles the status of

the web crawlers.

Crawler controller address -

http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb

http://sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb

We tested performance of several database

systems to determine what should we use

to store data.

WE CONSIDERED FOLLOWING DATA

STORAGE SYSTEMS

We considered performance for inserting

data and for retrieving 12 different

information needs.

Data set and source code -

https://github.com/madurangasiriwardena/performance-test

https://github.com/madurangasiriwardena/performance-test

DATA INSERTION TIME COMPARISON

INFORMATION RETRIEVAL PERFORMANCE

COMPARISON - PART 1

INFORMATION RETRIEVAL PERFORMANCE

COMPARISON - PART 2

Cassandra performed better than others in

most of the scenarios, and its insertion

time increased linearly.

So we chose it for implementing the

corpus.

USER INTERFACE DESIGN AND

IMPLEMENTATION

● Web interface of Sinmin has been designed for users

who would prefer a visualised and summarized view

of statistical data of Sinmin.

● Visual design of the interface has been made in a

way that any user without prior experience of the

interface is able to fulfill his information

requirements with little effort.

http://sinhala-corpus.projects.uom.lk/sinmin-web/

http://sinhala-corpus.projects.uom.lk/sinmin-web

CORPUS API DESIGN AND IMPLEMENTATION

• REST API to expose Corpus services

• Much complex and customizable data retrieval and

filtering

• Interface for third party applications to consume

PUBLICATIONS

● Comparison between performance of various

database systems for implementing a language

corpus - 11th Beyond Databases, Architectures and

Structures conference (Pending)

● Implementing a Corpus for Sinhala Language -

Symposium on Language Technology for South Asia

(Pending)

REMAINING WORK FOR THE NEXT PHASE

• Finish writing crawlers

• Feed data to Cassendra database

• Connecting front end with API calls

Questions?

Thank you!