Identifying Relevant Temporal Expressions for Real-world Events

Preview:

Citation preview

Identifying Relevant Temporal ExpressionsIdentifying Relevant Temporal Expressionsfor Real-World Events

N tti K h b 1 S R 2 A é St t1Nattiya Kanhabua1, Sara Romano2, Avaré Stewart1

1L3S Research CenterGLeibniz Universität Hannover, Germany

2Dipartimento di Informatica e SistemisticaU i it F d i II N l It lUniversity Federico II Naples, Italy

MotivationMotivation

• Numerous works have shown the potential of using Twitter to infer the existence and magnitude of real-world events in real-time– Earthquake [Sakaki et al., 2010]– Influenza epidemics [Culotta, 2010; Lampos et al.,

2011; Paul et al., 2011]

• In the medical domain, there has been a surge in gdetecting health related tweets for early warning– Allow a rapid response from authoritiesp p

Health related tweetsHealth related tweets

• User status updates or news related to public health are common in Twitterp– I have the mumps...am I alone?

b b i l h G t t iti t!! Pl– my baby girl has a Gastroenteritis so great!! Please do not give it to meee

– #Cholera breaks out in #Dadaab refugee camp in #Kenya http://t.co/....

– As many as 16 people have been found infected with Anthrax in Shahjadpur upazila of the Sirajganj district in Bangladesh.

Extracting outbreak eventsExtracting outbreak events

• Support a comparative, temporal analysis between Twitter and official sources– World Health Organization1

– ProMED-mail2

[Kanhabua et al., 2012]

1http://www.who.int2http://www.promedmail.org/

Problem statementProblem statement

• How to extract real-world events from unstructured text documents?– Previous work finds interesting time for an event, but

not determining the relevance of time

• How to determine the relevance of temporal pexpressions for extracted events?– Not all temporal expressions associated to an eventNot all temporal expressions associated to an event

are equally relevant

Related workRelated work

• Extract temporal expressions from unstructured text using time and event recognition algorithms– [Verhagen et al., 2005; Strötgen et al., 2012]

• Harvest temporal knowledge from semi-structured contents like Wikipedia infoboxescontents like Wikipedia infoboxes– [Hoffart et al., 2012]

ContributionsContributions

• An approach to extracting real-world events automatically from unstructured texts

• A machine learning approach to identifying relevant temporal expressionsrelevant temporal expressions– Three classes of features for learning relevance

• Experiments on real-world data and 3,500 manually judged relevance pairs

System architectureSystem architecture

• Extract events in a pipeline fashion Unstructured

text collection

Sentence Extraction

Tokenization gg g

Part-of-speechTagging

• Annotated documents– named entities (diseases, Text Annotation

Temporal Expression Extractiong

Named Entity

RecognitionAnnotated Document

s

victims and locations)– temporal expressions

a set of sentences

IdentifyingRelevant

Time

Event Aggregation

Event Profiles

browsing/ – a set of sentences

• Event e: (v, m, l, te)who (victim v) was infected

Event Extraction

User

gretrieving

– who (victim v) was infected – what (disease m) causes– where (location l)( )– when (time te)

Two time aspectsTwo time aspects

1. Publication time2. Content or event time

Two time aspectsTwo time aspects

content time

publication time

Event extractionEvent extraction

• An event is a sentence containing two entities– (1) medical condition and (2) geographic expression– A minimum requirement by domain experts

• A victim and the time of an event can be identified• A victim and the time of an event can be identified from the sentence itself, or its surrounding context

• Output: a set of event candidates

Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak

fin Uganda since the beginning of July 2012

Identifying relevant timeIdentifying relevant time

• The task of identifying relevant time is regarded as a classification problem– Two classes: (1) relevant and (2) irrelevant

• Definition: relevant referring to the starting, g g,ending or ongoing time of the event

• Learn relevance using three classes of featuresLearn relevance using three classes of features– Sentence-based features– Document-based features– Corpus-specific features

FeaturesFeatures

• Sentence-based– senLen, senPos, isContext, cntEntityInS, cntTExpInS,

cntTPointInS, cntTPeriodInS, entityPos, entityPosDist, TExpPos, TExpPosDist, timeDist, entityTExpPosDist

• Document-based– cntEntityInD, cntEntitySen, cntTExpInD, cntTPointInD,

cntTPeriodInD• Domain-specific

– isNeg, isHistory

ExperimentsExperiments

• Settings– Official outbreak reports posted during the year 2011

– The number of documents and sentences• ProMED-mail: 2,977 documents and 95,465 sentencesProMED mail: 2,977 documents and 95,465 sentences• WHO: 59 documents and 761 sentences

– Series of NLP tools including– Series of NLP tools including• OpenNLP (tokenization, sentence splitting, POS tagging)• OpenCalais (named entity recognition) p ( y g )• HeidelTime (temporal expression extraction)

– Our dataset: manually selected 25 infectious diseasesOur dataset: manually selected 25 infectious diseases (medical conditions) by medical professionals

ResultsResults• Baseline: majority class with

accuracy of 0.58

D i i t (J48) i th b t• Decision tree (J48) is the best among other classification algorithms

• Sentence-based features improved the accuracy of

fbaseline significantly

• senLen and entityPosDistf b t 0 65perform best accuracy=0.65

• The combination of different features gained high accuracyfeatures gained high accuracy

SummarySummary

• An approach to extracting real-world events automatically from unstructured texts

• A machine learning approach to identifying the relevant temporal expressionsrelevant temporal expressions– Three classes of features for learning relevance

• Experiments on real-world data and 3,500 manually judged relevance pairs

ReferencesReferences• [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages.

In Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.• [Hoffart et al., 2012] J. Hoffart, F. Suchanek, K. Berberich, and G. Weikum. Yago2: A spatially

and temporally enhanced knowledge base from wikipedia. Artificial Intelligence Journal, Special Issue on Wikipedia and Semi-Structured Resources, 2012.

• [Kanhabua et al., 2012] N. Kanhabua, Sara Romano, A. Stewart and W. Nejdl. Supporting Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.

• [Lampos et al 2011] V Lampos and N Cristianini Nowcasting events from the social web with• [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with statistical learning. ACM TIST, 3, 2011.

• [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health. In Proceedings of ICWSM’2011, 2011.[S k ki t l 2010] T S k ki M Ok ki d Y M t E th k h k t itt• [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of WWW’2010, 2010.

• [Strötgen et al., 2012] J. Strötgen, O. Alonso, and M. Gertz. Identification of top relevant temporal expressions in documents. In Proceeding of the 2nd Temporal Web Analytics Workshop (TempWeb02), 2012.

• [Strötgen et al., 2010] J. Strötgen and M. Gertz. Heideltime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, 2010.,

• [Verhagen et al., 2005] M. Verhagen, I. Mani, R. Sauri, J. Littman, R. Knippen, S. B. Jang, A. Rumshisky, J. Phillips, and J. Pustejovsky. Automating temporal annotation with TARSQI. In Proceedings of ACL’2005, 2005.

Recommended