Upload
theodora-summers
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
CIS 895 – MSE PROJECT
KDD- Service based Numerical Entity Searcher (KSNES)
Presentation 3 on April 14th , 2009
Naga Sowjanya [email protected]
1
OUTLINE Introduction
Terms Motivation Goal
Project Overview Project Data Flow Diagram Component Design Project Evaluation Future Work Prototype Demonstration Questions / Comments
2
TERMS[1] Knowledge Discovery in Databases
(KDD) a group headed by Dr. Hsu primary focus is machine learning, data mining,
human-computer intelligent interaction
Natural Language Processing (NLP) To allow computers to process and understand
human languages Some areas like
Text Segmentation (identify word boundaries) Part-of-speech tagging Word sense disambiguation (words with more than one
meaning)3
TERMS[2] Named Entity Recognition (NER)
Locating and classifying atomic elements (single part of speech) in text into predefined categories such as Names of Persons Names of Locations Names of Organizations Names of Miscellaneous Entities
Example Dr. William H. Hsu is a Professor at Kansas State
University located in Manhattan, Kansas. Dr. [PER William H. Hsu ] is a Professor at [ORG
Kansas State University ] located in [LOC Manhattan ] , [LOC Kansas ] . 4
TERMS[3] Shallow Parsing/Chunking
NLP technique that attempts to look for key phrases but not to fully parse into a parse tree.
Output - series of words mostly nouns, verbs, preposition phrases etc.,
Example Chunker: [NP He ] [VP reckons ] [NP the current
account deficit ] [VP will narrow ] [PP to ] [NP only L1.8 billion ]
Full Parser: (PRP)He (VBZ)reckons (DT)the (JJ)current (NN)account (NN)deficit (MD)will (VB)narrow (TO)to (RB)only (L)L (CD)1.8 (CD)billion
5
PROJECT OVERVIEW[1]
Motivation Occurrence of events is naturally anchored in
time within the narrative text Is Bush currently the President of America? When was India attacked by Pakistan in last century?
To know the quantities of entities How many Oscar awards are won by Steven Spielberg? What was the highest temperature recorded in the
year 2008?
6
PROJECT OVERVIEW[2]
Goal To develop a system that
extracts Numerical Phrases from raw text displays value – unit – unit-type
System is set as a service on the web server User interacts through a webpage
Numerical Phrase: Types Number Phrase
33 dollars, 100 Watts, 13 years, two miles Date Phrase
Aug 1998, Nov 10th 1984, between 1989 and 2006 7
PROJECT OVERVIEW[3] Purpose
To understand the timestamp of an event To understand the order of occurrence of events To understand the persistence of an event i.e.,
the time period over which the event occurred and continued
For KDD Group To gather certain statistical information from the
data they gather by crawling different web pages How many cattle have been affected by the virus? When did the disease break out?
Sample NABC (National Agricultural Bio-Security Centre) data is given to the system for testing 8
APPLICATION AREAS
Textual Entailment (TE) Recognition Given two fragments, whether the meaning of
one text can be inferred from another text. Question Answering (QA) System
Identifies text that entails the expected answer.
Possible inferences (TE) 10,000 cattle were killed because of RVF. RVF occurred during 1997.
Possible Questions (QA) How many cattle were killed during 1997 RVF
outbreak? When did RVF occur?
9
Ex: During 1997, 10,000 cattle were killed because of the RVF.
SYSTEM OVERVIEW
10
PROJECT DATA FLOW DIAGRAM:
NUMERICAL ENTITY SEARCHER
11
MODULES IN THE PROJECT
Webpage (JSP): For requesting and receiving information from the service.
POS Tagger (Java): Stanford POS Tagger
Numerical Phrase Extractor (Java): Implemented using Shallow Parsing Technique
Number-Unit/Date Pattern Recognizer (Java): Implemented based on the Numerical Quantifier developed by Benjamin Sapp, UIUC.
12
POS TAGGER TAGSET
13
http://www.cs.ualberta.ca/~lindek/650/Slides/POSTagging.ppt
IMPLEMENTING NUMERICAL PHRASE EXTRACTOR
Input: Tagged Text I/PRP lost/VBD thirty-three/JJ dollars/NNS in/IN
1998/CD
Regular expressions (regex) are used to determine the numerical patterns in the input. thirty-three/JJ dollars/NNS in/IN 1998/CD
Output: Numerical Phrases thirty-three dollars in 1998
14
SOME PATTERNS
"\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN"parses
"(between|Between|from|From|In|in|since| Since|during|During)/IN ..../CD (([a-zA-Z]+/CC|[a-z]
+/TO) ..../CD)?”parses
'between 1987 and 1997', 'in 2007 and 2008’15
\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN
3-2/JJ lead/NN
20-20/JJ match/NN
COMPONENT DESIGN
16
Contains class variables and functions
Added separate table to describe the roles of functions
COMPONENT DESIGN (MYPATTERNS)[1]
17
Patterns Matching Numerical Phrases
p_wordsabout, around, approximately, more than, nearly, almost, no
more than, at least, less than, no fewer than
p_tnl this, next, last, since, in
p_inl between, from, in, since, during
p_words + p_abtfrac
about two-thirds of the vote, millions of books
p_words + p_age
27 year-old bachelor, 27-year-old bachelor
p_words + p_ampm
About 3:00 a.m., 4:15 p.m. CST
p_and 3,792 children and adolescents
p_tnl + p_anydate
Oct 1st 1987, Nov 5, December 21, 1998
COMPONENT DESIGN (MYPATTERNS)[2]
18
Patterns Matching Numerical Phrases
p_inl + p_btwfrm
between 1987 and 1997, in 2007 and 2008
p_inl + p_btwfrmd
from 200 to 300 miles, from 7.5 percent to 6.85 percent
p_date 18 April 2008
p_tnl + p_days this Monday, next Saturday, last Friday, Tuesday, Wednesday,
p_centuary 17th century, 17th-centuary
p_words + p_hyphenww
million-dollar home, six-bedroom home, thirty-three dollars
p_hyphennumnum
the 20-20 match, a 3-2 lead
p_in 9 in 10 people, 1 in every 8 women
p_mids mid-1990s, the early 1990s, 1970s
p_months January, February, December, Jan, Feb, Sept, Dec
COMPONENT DESIGN (MYPATTERNS)[3]
19
Patterns Matching Numerical Phrases
p_words + p_numunit
33 USD, about 34 miles, 33,333 tons, 3.3 million dollars, one thing, 3.4 billion
p_words + p_per
$33 per day, about 100 miles per hour
p_words + p_percentinches
39%, 0.5-1%, about 90 %, 20"
p_ratio one of the five people, 89 percent of people, 3 out of 5 people
p_tty today, tomorrow, yesterday, noon
p_twmythis year, this month, next year, next month, last week, last
year, last month
p_xbits 1024KB, 8MB, 320GB, 1TB
p_words + p_yrange
In 1998-99, during 2000-09
SAMPLE SENTENCES[1]
Sentence Patterns
I have lost 33,000 dinars in 1998 p_numnitp_btwfrm
At just 12-years-old, he enrolled as a freshman at F.I.U. in Miami.
p_age
The 20" iMac is cheaper at $1200 and it has a 320GB hard drive.
p_percentinchesp_numunitp_xbits
Volunteers bring in a heavy crane for work on a bridge last month.
p_twmy
As for those who do not invest, around 40% say capitalism is better.
p_percentinches
As of 7 January 2007, about 75 people have died and another 183 infected.
p_datep_numunit
20
SAMPLE SENTENCES[2]
Sentence Patterns
Approximately 1% of human sufferers die of the disease.
p_percentinches
Current listings of 2,000 children and adults who are reported missing, including in-depth coverage of high-profile cases.
p_and
38 of the 62 patients who provided blood samples tested positive.
p_ratio
She became an exotic dancer at Scores in New York City in the mid-1990s.
p_mids
Peterson's three capped the surge, giving New Orleans a 64-51 lead.
p_numunitp_hyphennumnum
21
PROBLEMS ENCOUNTERED
22
Determining the Patterns Lots of Numerical Phrases found Designed Patterns to filter more than one kind of
Numerical Pattern
Prioritizing the Patterns More than one pattern may match the same
Numerical Phrase To avoid clashes between the Patterns
PROJECT EVALUATION[1]
23
Test Case Main Functionality Tested Pass/Fail
Test Case 1 Application Functionality Pass
Test Case 2 POS Tagger Functionality Pass
Test Case 3 Numerical Phrase Extractor Functionality Pass
Test Case 4Number-Unit/Date Pattern Recognizer
FunctionalityPass
PROJECT EVALUATION[2]
24
Phase Expected Completion Phase Actual Completion Phase
1 February 26, 2009 February 24, 2009
2 March 26, 2009 March 31, 2009
3 April 17, 2009 April 14, 2009
PROJECT EVALUATION[3]
Phase 2 took more time since Implementation and Testing are done simultaneously
25
PROJECT EVALUATION[4]
More time for Coding and the Documentation
26
PROJECT EVALUATION[5]
More time spent in discussing since it’s the initial phase
27
PROJECT EVALUATION[6]
More time is spent in Coding after gather the requirements in the first phase.
28
PROJECT EVALUATION[7]
Lot of time spent on Documenting the things as per the ETDR standards.
29
FUTURE WORK
Adding more Patterns To filter more different kinds of numerical
phrases
Improving the Output Display By displaying the number and date phrases in
different colors To make it more readable for the user
30
LESSONS LEARNED
Java Tool Usage Java Eclipse IDE
Design Development MS Visio SDLC Documentation
31
PROTOTYPE DEMONSTRATION
KSNES Project Set up as a Service on the CIS Server A webpage is set up:
http://viper.cis.ksu.edu:11603/numerical/
32
FINAL STEPS
Final Examination Ballot Make necessary changes to the MSE Portfolio Deliver the Portfolio
33
Questions??
Suggestions!!
THANK YOU 34