Data Association for Topic Intensity Tracking

Andreas KrauseJure Leskovec

Carlos Guestrin

School of Computer Science, Carnegie Mellon University

Document classification Two topics: Conference and Hiking

Will you go toICML too?

Let’s go hikingon Friday!

P(C | words) = .9 P(C | words) = .1

Conference Hiking

A more difficult example Two topics: Conference and Hiking

What if we had temporal information? How about modeling emails as HMM?

Let’s have dinnerafter the talk.

Should we go onFriday?

P(C | words) = .7 P(C | words) = .5

Could refer to both topics!

2:00 pm 2:03 pm

Conference

Assumes equal time steps,“smooth” topic changes.Valid assumptions?

Typical email traffic

Email traffic very bursty Cannot model with uniform time steps! Bursts tell us, how intensely a topic is pursued

Bursts are potentially very interesting!

0 50 100 150 2000

Time (days)

Identifying both topics and bursts Given:

A stream of documents (emails):

d1, d2, d3, …

and corresponding document inter-arrival times (time between consecutive documents):

Δ1, Δ2, Δ3, ...

Simultaneously: Classify (or cluster) documents into K topics Predict the topic intensities – predict time between

consecutive documents from the same topic

Data association problem If we know the email topics, we can identify bursts

If we don’t know the topics, we can’t identify bursts! Naïve solution: First classify documents, then identify bursts

Can fail badly! This paper:

Simultaneously identify topics and bursts!

High intensity for “Conference”,Low intensity for “Hiking”

Low intensity for “Conference”,High intensity for “Hiking”

Intensity for “Conference” ???Intensity for “Hiking” ???

Conference

Hiking

The Task Have to solve a data association problem: We observe:

Message Deltas – time between the arrivals of consecutive documents

We want to estimate: Topic Deltas – time between messages of the same topic We can then compute the topic intensity L =

E[ 1/ Therefore, need to associate each document with a topicNeed topics to

identify intensity

Need intensity toclassify (better)

Chicken and Eggproblem:

How to reason about topic deltas? Associate with each email timestamps vectors

for topic arrivals

C: 2:00 pmH: 2:30 pm

Email 1,ConferenceAt 2:00 pm

C: 4:15pmH: 2:30 pm

Email 2,HikingAt 2:30 pm

C: 4:15 pmH: 7:30 pm

Next arrivalof email fromConference,Hiking

Timestamp vector [t(C),t(H)]

Message = 30min (betw. consecutivemessages)

Topic = 2h 15min (consecutive msg. of same topic)

τ1 τ2 τ3

L(H)t-1 L(H)

t L(H)t+1

τt-1 τt τt+1

Ct DtΔt

Generative Model (conceptual)L(C)

t-1 L(C)t L(C)

t = [t(C),t

(H)]Time for next email

from topic(exponential dist.)

Time betweensubsequent

emails

Topic indicator

(i.e., = “Conference”)

Document(e.g., bag of

words)

Intensity for“Conference”(parameter forexponential d.)

Intensity for“Hiking”

(parameter forexponential d.)

Problem:

Need to reason about entire history of timesteps t!Makes inference intractable, even for few topics!

Key observation: If topic follow exponential distribution:

P(t+1(C) > 4pm | t

(C) = 2pm, it’s now 3pm) = P(t+1

(C) > 4pm | t (C) = 3pm, it’s now 3pm)

Exploit memorylessness to discard timestamps t

Exponential distribution appropriate: Previous work on document streams (E.g., Kleinberg ‘03) Frequently used to model transition times When adding hidden variables, can model arbitrary

transition distributions (cf., Nodelman et al)

Last arrival time irrelevant!

L(H)t-1 L(H)

t L(H)t+1

τt-1 τt τt+1

Ct DtΔt

Generative Model (conceptual)L(C)

t-1 L(C)t L(C)

t = [t(C),t

(H)]Time for next

email from topic(exponential

dist.)

Time betweenSubsequent

emails

Topic indicator

DocumentRepresentation

(words)

Intensity for“Conference”

Implicit Data Association (IDA) Model

Key modeling trick Implicit data association (IDA) via

exponential order statistics

P(t | Lt) = min { Exp(Lt(C)), Exp(Lt

(H)) }

P(Ct | Lt) = argmin { Exp(Lt(C)), Exp(Lt

(H)) }

Simple closed form for these order statistics! Quite general modeling idea Turns model (essentially) into Factorial HMM Many efficient inference techniques available!

C: 2:00 pmH: 2:30 pm

C: 4:15pmH: 2:30 pm

Email 2,HikingAt 2:30 pm

C: 4:15 pmH: 7:30 pm

Ct DtΔt

Inference Procedures We consider:

Full (conceptual) model:Particle filter

Simplified Model:Particle filterFully factorized mean fieldExtract inference

Comparison to a Weighted Automaton Model for single topics, proposed by Kleinberg (first classify, then identify bursts)

Results (Synthetic data) Periodic message arrivals (uninformative Δ) with

noisy class assignments: ABBBABABABBB…

Misclassification Noise

0 20 40 60 800

Message number

noisy class assigments: ABBBABABABBB…

0 20 40 60 800

Message number

Topic Part. Filt.(Full model)

0 20 40 60 800

Message number

Exactinference

0 20 40 60 800

Message number

Exactinference

Weighted automaton(first classify, then bursts)

Implicit Data Association get’s both topics and frequencies right, inspite severe (30%) label noise.

Memorylessness trick doesn’t hurt

Separate topic and burst identification fails badly.

Inference comparison (Synthethic data) Two topics, with different frequency pattern

0 50 100 150 200 250 3000

Message number

0 50 100 150 200 250 3000

Message number

Message

0 50 100 150 200 250 3000

Message number

Message

Exactinference

0 50 100 150 200 250 3000

Message number

Message

Exactinference

Particlefilter

0 50 100 150 200 250 3000

Message number

Message

Exactinference

Particlefilter

Mean-field

Implicit Data Association identifies true frequency parameters (does not get distracted by observed )

In addition to exact inference (for few topics),several approximate inference techniques perform well.

Experiments on real document streams ENRON Email corpus

517,431 emails from 151 employees Selected 554 messages from tech-memos and

universities folders of Kaminski Stream between December 1999 and May 2001

Reuters news archive Contains 810,000 news articles Selected 2,303 documents from four topics:

wholesale prices, environment issues, fashion and obituaries

Intensity identification for Enron data

0 50 100 150 200 250 3000

Message number

Enron data

0 50 100 150 200 250 3000

Message number

Topic WAM

Enron data

0 50 100 150 200 250 3000

Message number

Topic WAMIDA-IT

Enron data

0 50 100 150 200 250 3000

Message number

Implicit Data Association identifies bursts which are missed by Weighted Automaton Model (separate approach)

Topic WAMIDA-IT

Reuters news archive

Again, simultaneous topic and burst identification outperforms separate approach

0 100 200 300 400 500 6000

Message numberT

TrueWAMIDA-IT

0 100 200 300 400 500 600 7000

Message number

Topic WAMIDA-IT

What about classification? Temporal modeling effectively changes class

prior over time. Impact on classification accuracy?

Classification performance

Modeling intensity leads to improved classification accuracy

NaïveBayes

IDAModel

Generalizations Learning paradigms

Not just supervised setting, but also: Unsupervised- / semisupervised learning Active learning (select most informative labels) See paper for details.

Other document representations Other applications

Fault detection Activity recognition …

L(H)t-1 L(H)

t L(H)t+1

Ct DtΔt

Topic trackingL(C)

t-1 L(C)t L(C)

Time betweenSubsequent

emails

Topic indicator

DocumentRepresentation

Intensity for“Conference”

tt-1 t+1

Topic param.(Mean for LSI

representation)

t tracks topic means (Kalman Filter)

Conclusion General model for data association in

data streams

A principled model for “changing class priors” over time

Can be used in supervised, unsupervised and (semisupervised) active learning setting

Conclusion Surprising performance of simplified IDA model

Exponential order statistics enable implicit data association and tractable exact inference

Synergetic effect between intensity estimation and classification on several real-world data sets

Data Association for Topic Intensity Tracking

Documents

Face in profile view reduces perceived facial expression intensity: An eye-tracking study

Topic Detection and Tracking Pilot Study Final Report

News Keyword Extraction for Topic Tracking

BITT: A Corpus for Topic Tracking Evaluation on Multimodal

Chapter 9 Topic Detection and Tracking

chapter24 PC kumar - University of Floridathe intensity distribution are shown ... slit interference pattern. Diffraction Grating in CD Tracking ... The intensity of the polarized

Assessed Work Theme/Topic Tracking Assessments (Homework, …sydneyrussellschool.com/wp-content/uploads/2015/07/Art-Overview-… · (Homework, Peer-assessment, ….) Tracking Assessments

Practical skills on the topic: The study and evaluation of the intensity of the noise and vibration

Energy Intensity Baselining and Tracking Guidance...Energy Intensity Baselining and Tracking Guidance iv Adjustment Modification to the energy consumption data to account for facility

Laser Intensity-Based Obstacle Detection and Tracking

Wikipedia-based Kernels for Dialogue Topic Tracking

304-2012: Topic Discovery, Tracking, and Characterization of

Lesson topic 5.1 Shielding effects on radiation intensity and dose

BENEFIT METHODOLOGY - University of · PDF fileContents • Training Topic Objectives • Benefit Tracking Background • Benefit Tracking Tool • Benefit Tracking Training 3/23/2016

Topic Detection and Tracking Pilot Study Final Report - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/yiming/Publications/allan-tdt1-final-report.pdf · Topic Detection and Tracking

Robust Visual Tracking via Structured Multi-Task Sparse ...vision.ai.illinois.edu/publications/zhang_ijcv13.pdf · Visual tracking is an important topic in computer vision and it

Literature Review Topic Presentation Risk Knowledge intensity Capability maturity Information flow along supply chains

Intensity-Modulated Radiation Therapy...radiation therapy (e.g., 3D positional tracking, gating, 3D surface tracking), each fraction of treatment For additional coding guidance, please

Topic regards: ◆ Browsing of Search Results ◆ Video Retrieval using Spatio-Temporal ◆ Object Tracking ◆ Face tracking Yuan-Hao Lai

Topic 6.1 Summary of Previous Research related to … · Topic 6.1 Summary of Previous Research related to Track, Intensity, and Structure Changes at Landfall ... (PSA), Ming-Yen