34
CIS 895 – MSE PROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th , 2009 Naga Sowjanya Karumuri [email protected] 1

CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri [email protected] 1

Embed Size (px)

Citation preview

Page 1: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

CIS 895 – MSE PROJECT

KDD- Service based Numerical Entity Searcher (KSNES)

Presentation 3 on April 14th , 2009

Naga Sowjanya [email protected]

1

Page 2: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

OUTLINE Introduction

Terms Motivation Goal

Project Overview Project Data Flow Diagram Component Design Project Evaluation Future Work Prototype Demonstration Questions / Comments

2

Page 3: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

TERMS[1] Knowledge Discovery in Databases

(KDD) a group headed by Dr. Hsu primary focus is machine learning, data mining,

human-computer intelligent interaction

Natural Language Processing (NLP) To allow computers to process and understand

human languages Some areas like

Text Segmentation (identify word boundaries) Part-of-speech tagging Word sense disambiguation (words with more than one

meaning)3

Page 4: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

TERMS[2] Named Entity Recognition (NER)

Locating and classifying atomic elements (single part of speech) in text into predefined categories such as Names of Persons Names of Locations Names of Organizations Names of Miscellaneous Entities

Example Dr. William H. Hsu is a Professor at Kansas State

University located in Manhattan, Kansas. Dr. [PER William H. Hsu ] is a Professor at [ORG

Kansas State University ] located in [LOC Manhattan ] , [LOC Kansas ] . 4

Page 5: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

TERMS[3] Shallow Parsing/Chunking

NLP technique that attempts to look for key phrases but not to fully parse into a parse tree.

Output - series of words mostly nouns, verbs, preposition phrases etc.,

Example Chunker: [NP He ] [VP reckons ] [NP the current

account deficit ] [VP will narrow ] [PP to ] [NP only L1.8 billion ]

Full Parser: (PRP)He (VBZ)reckons (DT)the (JJ)current (NN)account (NN)deficit (MD)will (VB)narrow (TO)to (RB)only (L)L (CD)1.8 (CD)billion

5

Page 6: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT OVERVIEW[1]

Motivation Occurrence of events is naturally anchored in

time within the narrative text Is Bush currently the President of America? When was India attacked by Pakistan in last century?

To know the quantities of entities How many Oscar awards are won by Steven Spielberg? What was the highest temperature recorded in the

year 2008?

6

Page 7: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT OVERVIEW[2]

Goal To develop a system that

extracts Numerical Phrases from raw text displays value – unit – unit-type

System is set as a service on the web server User interacts through a webpage

Numerical Phrase: Types Number Phrase

33 dollars, 100 Watts, 13 years, two miles Date Phrase

Aug 1998, Nov 10th 1984, between 1989 and 2006 7

Page 8: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT OVERVIEW[3] Purpose

To understand the timestamp of an event To understand the order of occurrence of events To understand the persistence of an event i.e.,

the time period over which the event occurred and continued

For KDD Group To gather certain statistical information from the

data they gather by crawling different web pages How many cattle have been affected by the virus? When did the disease break out?

Sample NABC (National Agricultural Bio-Security Centre) data is given to the system for testing 8

Page 9: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

APPLICATION AREAS

Textual Entailment (TE) Recognition Given two fragments, whether the meaning of

one text can be inferred from another text. Question Answering (QA) System

Identifies text that entails the expected answer.

Possible inferences (TE) 10,000 cattle were killed because of RVF. RVF occurred during 1997.

Possible Questions (QA) How many cattle were killed during 1997 RVF

outbreak? When did RVF occur?

9

Ex: During 1997, 10,000 cattle were killed because of the RVF.

Page 10: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

SYSTEM OVERVIEW

10

Page 11: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT DATA FLOW DIAGRAM:

NUMERICAL ENTITY SEARCHER

11

Page 12: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

MODULES IN THE PROJECT

Webpage (JSP): For requesting and receiving information from the service.

POS Tagger (Java): Stanford POS Tagger

Numerical Phrase Extractor (Java): Implemented using Shallow Parsing Technique

Number-Unit/Date Pattern Recognizer (Java): Implemented based on the Numerical Quantifier developed by Benjamin Sapp, UIUC.

12

Page 13: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

POS TAGGER TAGSET

13

http://www.cs.ualberta.ca/~lindek/650/Slides/POSTagging.ppt

Page 14: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

IMPLEMENTING NUMERICAL PHRASE EXTRACTOR

Input: Tagged Text I/PRP lost/VBD thirty-three/JJ dollars/NNS in/IN

1998/CD

Regular expressions (regex) are used to determine the numerical patterns in the input. thirty-three/JJ dollars/NNS in/IN 1998/CD

Output: Numerical Phrases thirty-three dollars in 1998

14

Page 15: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

SOME PATTERNS

"\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN"parses

"(between|Between|from|From|In|in|since| Since|during|During)/IN ..../CD (([a-zA-Z]+/CC|[a-z]

+/TO) ..../CD)?”parses

'between 1987 and 1997', 'in 2007 and 2008’15

\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN

3-2/JJ lead/NN

20-20/JJ match/NN

Page 16: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

COMPONENT DESIGN

16

Contains class variables and functions

Added separate table to describe the roles of functions

Page 17: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

COMPONENT DESIGN (MYPATTERNS)[1]

17

Patterns Matching Numerical Phrases

p_wordsabout, around, approximately, more than, nearly, almost, no

more than, at least, less than, no fewer than

p_tnl this, next, last, since, in

p_inl between, from, in, since, during

p_words + p_abtfrac

about two-thirds of the vote, millions of books

p_words + p_age

27 year-old bachelor, 27-year-old bachelor

p_words + p_ampm

About 3:00 a.m., 4:15 p.m. CST

p_and 3,792 children and adolescents

p_tnl + p_anydate

Oct 1st 1987, Nov 5, December 21, 1998

Page 18: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

COMPONENT DESIGN (MYPATTERNS)[2]

18

Patterns Matching Numerical Phrases

p_inl + p_btwfrm

between 1987 and 1997, in 2007 and 2008

p_inl + p_btwfrmd

from 200 to 300 miles, from 7.5 percent to 6.85 percent

p_date 18 April 2008

p_tnl + p_days this Monday, next Saturday, last Friday, Tuesday, Wednesday,

p_centuary 17th century, 17th-centuary

p_words + p_hyphenww

million-dollar home, six-bedroom home, thirty-three dollars

p_hyphennumnum

the 20-20 match, a 3-2 lead

p_in 9 in 10 people, 1 in every 8 women

p_mids mid-1990s, the early 1990s, 1970s

p_months January, February, December, Jan, Feb, Sept, Dec

Page 19: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

COMPONENT DESIGN (MYPATTERNS)[3]

19

Patterns Matching Numerical Phrases

p_words + p_numunit

33 USD, about 34 miles, 33,333 tons, 3.3 million dollars, one thing, 3.4 billion

p_words + p_per

$33 per day, about 100 miles per hour

p_words + p_percentinches

39%, 0.5-1%, about 90 %, 20"

p_ratio one of the five people, 89 percent of people, 3 out of 5 people

p_tty today, tomorrow, yesterday, noon

p_twmythis year, this month, next year, next month, last week, last

year, last month

p_xbits 1024KB, 8MB, 320GB, 1TB

p_words + p_yrange

In 1998-99, during 2000-09

Page 20: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

SAMPLE SENTENCES[1]

Sentence Patterns

I have lost 33,000 dinars in 1998 p_numnitp_btwfrm

At just 12-years-old, he enrolled as a freshman at F.I.U. in Miami.

p_age

The 20" iMac is cheaper at $1200 and it has a 320GB hard drive.

p_percentinchesp_numunitp_xbits

Volunteers bring in a heavy crane for work on a bridge last month.

p_twmy

As for those who do not invest, around 40% say capitalism is better.

p_percentinches

As of 7 January 2007, about 75 people have died and another 183 infected.

p_datep_numunit

20

Page 21: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

SAMPLE SENTENCES[2]

Sentence Patterns

Approximately 1% of human sufferers die of the disease.

p_percentinches

Current listings of 2,000 children and adults who are reported missing, including in-depth coverage of high-profile cases.

p_and

38 of the 62 patients who provided blood samples tested positive.

p_ratio

She became an exotic dancer at Scores in New York City in the mid-1990s.

p_mids

Peterson's three capped the surge, giving New Orleans a 64-51 lead.

p_numunitp_hyphennumnum

21

Page 22: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROBLEMS ENCOUNTERED

22

Determining the Patterns Lots of Numerical Phrases found Designed Patterns to filter more than one kind of

Numerical Pattern

Prioritizing the Patterns More than one pattern may match the same

Numerical Phrase To avoid clashes between the Patterns

Page 23: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[1]

23

Test Case Main Functionality Tested Pass/Fail

Test Case 1 Application Functionality Pass

Test Case 2 POS Tagger Functionality Pass

Test Case 3 Numerical Phrase Extractor Functionality Pass

Test Case 4Number-Unit/Date Pattern Recognizer

FunctionalityPass

Page 24: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[2]

24

Phase Expected Completion Phase Actual Completion Phase

1 February 26, 2009 February 24, 2009

2 March 26, 2009 March 31, 2009

3 April 17, 2009 April 14, 2009

Page 25: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[3]

Phase 2 took more time since Implementation and Testing are done simultaneously

25

Page 26: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[4]

More time for Coding and the Documentation

26

Page 27: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[5]

More time spent in discussing since it’s the initial phase

27

Page 28: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[6]

More time is spent in Coding after gather the requirements in the first phase.

28

Page 29: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROJECT EVALUATION[7]

Lot of time spent on Documenting the things as per the ETDR standards.

29

Page 30: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

FUTURE WORK

Adding more Patterns To filter more different kinds of numerical

phrases

Improving the Output Display By displaying the number and date phrases in

different colors To make it more readable for the user

30

Page 31: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

LESSONS LEARNED

Java Tool Usage Java Eclipse IDE

Design Development MS Visio SDLC Documentation

31

Page 32: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

PROTOTYPE DEMONSTRATION

KSNES Project Set up as a Service on the CIS Server A webpage is set up:

http://viper.cis.ksu.edu:11603/numerical/

32

Page 33: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

FINAL STEPS

Final Examination Ballot Make necessary changes to the MSE Portfolio Deliver the Portfolio

33

Page 34: CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

Questions??

Suggestions!!

THANK YOU 34