Transcript
Page 1: Text Classification and Named Entities for New Event Detection

Text Classification and Named Entities for New

Event Detection

Giridhar Kumaran and James Allan

University of Massachusetts Amherst

SIGIR 2004

Page 2: Text Classification and Named Entities for New Event Detection

IntroductionNew Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm)Vector space model has achieved the best results to date.Better similarity metrics and document representations.

Page 3: Text Classification and Named Entities for New Event Detection

Previous ResearchIncreasing the number of features.Weight event-level features more heavily than more general topic-level features.Lexical chains (using WordNet)NED and tracking system.Named entities re-weighted and stop list created for each topic.Incremental TF-IDF

Page 4: Text Classification and Named Entities for New Event Detection

NED EvaluationAssign a confidence score between 0-1 by the NED algorithm, immediately or look-ahead.0 new, 1 oldDefine threshold results in the least cost.Detection Error Tradeoff (DET) curve is used to represent miss and false alarm.

Page 5: Text Classification and Named Entities for New Event Detection

Basic ModelCosine similarity

Page 6: Text Classification and Named Entities for New Event Detection

Modified ModelCosine is good, but make mistakes.The level of a hierarchy of events is of interest.Looking into other parameters like the category, the overlap of named entities, and the overlap of non-named entities.Develop a simple rules reflect the questions that a human being would ask before deciding if a story is new or old.

Page 7: Text Classification and Named Entities for New Event Detection

Modification to document model

Terms: health care – drugs, cost, coverage, plan, prescription..vs. locations and individuals.Solve: First placing stories into broad categories, and then computing term weights.Using topic types specified by the LDC.Classification according to LDC topics.Train in TDT2, test in TDT3.

Page 8: Text Classification and Named Entities for New Event Detection

Modification to Similarity Metric

Isolate the named entities and treat them preferentially (nothing new).Named entities are a double-edged sword, deciding when to use them can be tricky.

Page 9: Text Classification and Named Entities for New Event Detection

Multiple document representations

Alpha : all termsBeta : only named entitiesGamma : non-named entity termsEvent, GPE(Geographical and Political Entities ), Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time.

Page 10: Text Classification and Named Entities for New Event Detection

Election NewsGamma is less than 0.2, while beta spreads out. (2 Graphs) : using alpha + gamma

Page 11: Text Classification and Named Entities for New Event Detection

Legal/Criminal CasesGamma below 0.4, beta above 0.4 : use beta + alpha

Page 12: Text Classification and Named Entities for New Event Detection

Financial NewsCannot decide using beta or gamma: use alpha only.

Page 13: Text Classification and Named Entities for New Event Detection

Term scores and categories

(Table 4)

Page 14: Text Classification and Named Entities for New Event Detection

Experimental ResultsThe result seems to be worse in TDT4.TDT4 may contain topics not conductive to named entity-based modification.

Page 15: Text Classification and Named Entities for New Event Detection

DET Curve of TDT3Focus on the high accuracy area.

Page 16: Text Classification and Named Entities for New Event Detection

DET Curve of TDT4

Page 17: Text Classification and Named Entities for New Event Detection

Conclusion and Future Work

Present a new multi-stage system for NED.Show a way to harness the named entities in documents, and illustrate their utility in different situations.Improve named entity rulesDifferent ways to develop stop lists for different categoriesTemporal information


Recommended