Download ppt - Text Classification and Named Entities for New Event Detection

Text Classification and Named Entities for New

Event Detection

Giridhar Kumaran and James Allan

University of Massachusetts Amherst

SIGIR 2004

IntroductionNew Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm)Vector space model has achieved the best results to date.Better similarity metrics and document representations.

Previous ResearchIncreasing the number of features.Weight event-level features more heavily than more general topic-level features.Lexical chains (using WordNet)NED and tracking system.Named entities re-weighted and stop list created for each topic.Incremental TF-IDF

NED EvaluationAssign a confidence score between 0-1 by the NED algorithm, immediately or look-ahead.0 new, 1 oldDefine threshold results in the least cost.Detection Error Tradeoff (DET) curve is used to represent miss and false alarm.

Basic ModelCosine similarity

Modified ModelCosine is good, but make mistakes.The level of a hierarchy of events is of interest.Looking into other parameters like the category, the overlap of named entities, and the overlap of non-named entities.Develop a simple rules reflect the questions that a human being would ask before deciding if a story is new or old.

Modification to document model

Terms: health care – drugs, cost, coverage, plan, prescription..vs. locations and individuals.Solve: First placing stories into broad categories, and then computing term weights.Using topic types specified by the LDC.Classification according to LDC topics.Train in TDT2, test in TDT3.

Modification to Similarity Metric

Isolate the named entities and treat them preferentially (nothing new).Named entities are a double-edged sword, deciding when to use them can be tricky.

Multiple document representations

Alpha : all termsBeta : only named entitiesGamma : non-named entity termsEvent, GPE(Geographical and Political Entities ), Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time.

Election NewsGamma is less than 0.2, while beta spreads out. (2 Graphs) : using alpha + gamma

Legal/Criminal CasesGamma below 0.4, beta above 0.4 : use beta + alpha

Financial NewsCannot decide using beta or gamma: use alpha only.

Term scores and categories

(Table 4)

Experimental ResultsThe result seems to be worse in TDT4.TDT4 may contain topics not conductive to named entity-based modification.

DET Curve of TDT3Focus on the high accuracy area.

DET Curve of TDT4

Conclusion and Future Work

Present a new multi-stage system for NED.Show a way to harness the named entities in documents, and illustrate their utility in different situations.Improve named entity rulesDifferent ways to develop stop lists for different categoriesTemporal information