74
Searching in Harsh Environments Ophir Frieder Computer Science Dept. | Georgetown University & Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center [email protected] March 2016

MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Embed Size (px)

Citation preview

Page 1: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Searching in Harsh Environments

Ophir FriederComputer Science Dept. | Georgetown University &

Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center

[email protected] March 2016

Page 2: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Correcting the Search Myth

If it’s search, then Google solved it!

Some of what Google solved

Was solved by others first

Google’s focus is computerized data,

Much data are not digitized

Google is hardly a key social media player

Social media data are everywhere

2

Page 3: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Diverse Search Applications

Complex Document Information Processing

The whole is greater than the sum of its parts

Searching is easy

Unless it is in adverse (misspelled) environments

Social Media Search & Surveillance

Detecting outbreaks in their infancy

3

Page 4: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Talk Outline

Engineering Research

Searching &

Mining Social

Media

Searching in

Adverse

Conditions

Complex

Document

Information

Processing

Computer Science

4

Page 5: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Documents are Complex!

5

Page 6: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Complex documents include

handwritten notes,

diagrams,

graphics,

printed or formatted text

Point solutions exist:

OCR, Information Retrieval,

Information Extraction,

Image Processing, Text

Clustering, Computational

Stylistics, …

No definition of state-of-the-art

for the integrated problem

Manual partitioning/collating:

Expensive, time-consuming,

error-prone

Some are even more complex!

6

Page 7: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Optical character recognition

(OCR)

Document clustering and browsing

Document structure extractionExtraction from tables/lists

Handwriting analysis and signature

recognition

Figure caption identification

and extraction

Conventional and image retrieval

systems

Entity and relationship extraction

Existing Technology Point Solutions

7

Page 8: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

ComplexDocumentImages

LayersOCR

Table Extraction

Logo Extraction

Signature Match

Doc Metadata

Text Extraction

Entity Tagging

CDIP MetadataDatabase

Analyst

IntegratedRetrieval

DataMining

Enhance

Correcting the Search MythCDIP Processing Architecture

8

Page 9: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Enhancement

9

Page 10: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

10

Enhancement

Page 11: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

11

Enhancement

Page 12: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

12

Enhancement

Page 13: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

13

Without Logos: At which institution?

Without Text: What positions do I hold?

Ophir FriederMcDevitt Prof. of Comp. Sci. & Inf. Proc.

&

Prof. of Biostatistics, Bioinformatics, & Biomathematics

Integration Helps

13

Page 14: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Technology comes and goes

but….

Benchmarks (Collections) are ever (forever) lasting

14

Page 15: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Cover the richness of inputs

Range of formats, lengths, & genres

Variance in print and image quality

Document should include: Handwritten text and notations

Diverse fonts

Graphical elements

graphs, tables, photos, logos, and diagrams

Test Collection Characteristics

15

Page 16: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Sufficiently high volume of documents

Vast volume of redundant & irrelevant documents

Support diverse applications Include private communications within and between

groups planning activities and deploying resources

Publicly available data! Minimal cost

Minimal licensing

16

Test Collection Characteristics

Page 17: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

17

Data made public via legal proceedings

Master Settlement Agreement subset of UCSF Legacy Tobacco Document Library

Documents scanned by individual companies; hence scan quality widely varies

~ 7 million documents

~ 42 million scanned TIFF format pages (~ 1.5 TB)

~ 5 GB Metadata

~ 100 GB OCR

Dataset: https://ir.nist.gov/cdip/cdip-images/

17

CDIP Test Collection

Page 18: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

The CDIP Test Collection(NIST TREC V1.0)

18

Used multiple years in TREC Legal Track

Records (62GB) made available to TREC participants (through ftp/dvd)

40 queries simulating legal case investigations with relevant judgments produced by 35 lawyers.

Novel queries with relevant judgments generated by tobacco researchers

Page 19: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

CDIP Benchmark data – as a novel text test collection for “live scenarios”

NIST TREC Legal Track, 2006 - 2009

Housed permanently at NIST

Complex Document search

Ground truth difficult

800 hand checked sub-collection

Evaluation

19

Page 20: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Completed:

Subset of 800 documents

Manually labelled authorship & organizational unit

Evaluated:

Authorship, organizational, monetary, date, and

address-based retrieval tasks

Ongoing:

Subset of 20K documents.

Open Problem:

Performance evaluation (measures) for larger sets

Preliminary Results

20

Page 21: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

System Configuration Screen

System Configuration Screen

21

Page 22: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

22

Query: ATC Logo + “income forecast” + > $500,000

Page 23: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

23

Query: RJR Logo + “filtration efficiency” + signature

Page 24: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

24

Query: Five signatures with the highest dollar total

Page 25: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

25

Query: Associations of a given person (Dr. D. Stone)

Page 26: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Collaborators

Initial effort Gady Agam – Illinois Inst. of Tech.

Shlomo Argamon – Illinois Inst. of Tech.

David Doermann – Univ. of Maryland DARPA

David Grossman – Illinois Inst. of Tech. Grossman Lab

David D. Lewis – DDL Consulting

Sargur Srihari – SUNY Buffalo

Ongoing effort Gideon Frieder – George Washington Univ.

Jon Parker – Georgetown Univ. MITRE

26

Page 27: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn,, and K. Voorhees, “A Complex

Document Information Processing Prototype,” ACM SIGIR, 2006.

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a Test Collection for

Complex Document Information Processing,” ACM SIGIR, 2006.

G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, “Content-Based Document Image

Retrieval in Complex Document Collections,” Document Recognition and Retrieval, 2007.

G. Bal, G. Agam, O. Frieder, and G. Frieder, “Interactive Degraded Document Enhancement and Ground

Truth Generation,” Document Recognition and Retrieval , 2008.

T. Obafemi-Ajayi, G. Agam, and O. Frieder, “Historical Document Enhancement Using LUT Classification,”

International Journal on Document Analysis and Recognition, 13(1), March 2010.

J. Parker, G. Frieder, and O. Frieder, “Automatic Enhancement and Binarization of Degraded Document

Images,” International Conference on Document Analysis and Recognition, 2013.

J. Parker, G. Frieder, and O. Frieder, "Robust Binarization of Degraded Document Images using

Heuristics," Document Recognition and Retrieval XXI, San Francisco, California, February 2014.

Parker, et. al, "System and Method for Enhancing the Legibility of Degraded Images" US Patent

#8,995,782. March 31, 2015.

Frieder, et. al, "System and Method for Enhancing the Legibility of Images," US Patent #9,269,126.

February 23, 2016.

References

27

Page 28: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Talk Outline

Engineering Research

Searching &

Mining Social

Media

Searching in

Adverse

Conditions

Complex

Document

Information

Processing

Computer Science

28

Page 29: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Spelling in Adverse Conditions

Foreign language (Yizkor Books)

User unfamiliar with character pronunciation

Multiple languages within a document

Domain specific (Medical)

Terms unfamiliar to the general audience

29

Page 30: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Yizkor Books

Yizkor = Hebrew word for “remember”

Firsthand accounts of events that preceded, took place

during, and followed Second World War

Documents destroyed communities and people who perished

Started early 1940’s; highest activity in 1960’s and 1970’s

Published in 13 languages, across 6 continents

One of largest collections resides in USHMM

Access restricted due to limited number, fragile

state, and prevention of destruction or theft

30

Page 31: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Traditional Access

User requested; archivist driven

Requires “complete” understanding of books

High human resource costs

Inefficient & slow

Often fails to obtain complete, if any, results

31

Page 32: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Metadata Search Access

Provides an intuitive search capability for

apprehensive but interested users

Creates and queries collection metadata

32

Page 33: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Yikzor Interface

Centralized index

Global access

Efficient search

Accurate search

Multi-lingual spelling

correction

33

Page 34: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Search Results

34

Page 35: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Spell Checker

Upon entering a

misspelled query,

users are presented

with a ranked list of

suggestions

Percentages

represent similarity to

original query as

measured by our

algorithms

35

Page 36: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Query Processing

Language independent

string manipulation for

auto-correction via a

voting algorithm

36

Page 37: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Language Independent Correction

Simplistic Rules Work ! or ? Replace first and last characters by a wild card, in succession;

Retain only first and last characters and insert a wild card;

Retain only first and last two characters and insert a wild card;

Replace middle n-characters by a wild card, in succession;

Replace first half by a wild card;

Replace second half by a wild card;

37

Page 38: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Single Character Correction

Add Single Random Character

Remove Single Random Character

Replace Single Random Character

Swap Random Adjacent Pair of Characters

Mitton 1996 – “Spellchecking by Computers”

Found Rank

D-M Sound 41.41 N/A

N-Gram 94.97 2.58

USHMM 100 1.71

1.71

Found Rank

D-M Sound 41.96 N/A

N-Gram 93.40 3.46

USHMM 99.97 2.54

2.57

Found Rank

D-M Sound 57.89 N/A

N-Gram 85.02 4.77

USHMM 97.97 3.75

3.00

Found Rank

D-M Sound 31.45 N/A

N-Gram 92.06 3.24

USHMM 100 2.15

2.01

38

Page 39: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Multiple Character Correction

Add Multiple Characters Remove Multiple Characters

Found (%) Rank

2 Chars

DM Sound 19.58 N/A

N-Gram 92.00 3.45

USHMM 99.38 2.55 / 2.42

3 Chars

DM Sound 10.69 N/A

N-Gram 87.91 4.20

USHMM 97.53 3.19 / 3.02

4 Chars

DM Sound 6.75 N/A

N-Gram 83.86 4.97

USHMM 95.04 3.87 / 3.80

Found (%) Rank

2 Chars

DM Sound 20.47 N/A

N-Gram 84.79 4.78

USHMM 97.83 4.62 / 3.88

3 Chars

DM Sound 10.75 N/A

N-Gram 74.48 5.77

USHMM 92.73 6.41 / 4.80

4 Chars

DM Sound 9.70 N/A

N-Gram 69.98 6.04

USHMM 86.34 7.12 / 5.15

39

Page 40: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Multiple Character Correction

Replace Multiple Characters Swap Multiple Characters

Found (%) Rank

2 Chars

DM Sound 16.80 N/A

N-Gram 80.73 4.44

USHMM 93.88 4.19 / 3.33

3 Chars

DM Sound 9.11 N/A

N-Gram 69.23 5.15

USHMM 85.83 5.51 / 3.84

4 Chars

DM Sound 5.63 N/A

N-Gram 57.83 5.94

USHMM 75.03 6.79 / 4.78

Found (%) Rank

2 Chars

DM Sound 17.33 N/A

N-Gram 54.66 6.92

USHMM 71.69 7.55 / 5.46

3 Chars

DM Sound 9.19 N/A

N-Gram 42.91 7.30

USHMM 57.65 8.61 / 6.19

4 Chars

DM Sound 7.15 N/A

N-Gram 34.42 8.60

USHMM 46.31 9.32 / 7.30

40

Page 41: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Applying operational technology to a

medical domain…

Corrected spelling within a

Medical Terms Dictionary

41

Page 42: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Transcription Errors

“What is a prescribing error?”, J. Quality in Health Care, 2000;

9:232–237.

“Reducing medication errors and increasing patient safety: Case

studies in clinical pharmacology”, J. Clinical Pharmacology, July

2003. vol. 43 no. 7: 768-783.

“Preventing medication errors in community pharmacy: root-cause

analysis of transcription errors”, Quality and Safety in Health Care,

2007;16:285-290.

“10 strategies for minimizingdispensingerrors”, Pharmacy Times, Jan.

20th, 2010

Note: Although many of the transcription errors are

not spelling errors; some indeed are!

42

Page 43: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Medical Term Data Set

HosfordMedical Terms Dictionary v.3.0

Number of terms: 9,883

Term characteristics:

Average: 10.58

Minimum: 2

Maximum: 30

Median: 10

Mode: 10

43

Page 44: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Single Character Correction

Add Single Random Character

Remove Single Random Character

Replace Single Random Character

Swap Random Adjacent Pair of Characters

Found Rank

D-M Sound 38.54 N/A

3-Gram 99.67 1.08

Med-Find 100 1.03

1.03

Found Rank

D-M Sound 44.84 N/A

3-Gram 99.52 1.16

Med-Find 100 1.07

1.07

Found Rank

D-M Sound 62.73 N/A

3-Gram 96.39 1.50

Med-Find 99.54 1.42

1.27

Found Rank

D-M Sound 29.99 N/A

3-Gram 98.76 1.19

Med-Find 99.99 1.10

1.08

44

Page 45: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Multiple Character Correction

Add Multiple Characters Remove Multiple Characters

Found (%) Rank

2 Chars

DM Sound 16.40 N/A

3-Gram 98.48 1.29

Med-Find 99.55 1.17 / 1.15

3 Chars

DM Sound 7.00 N/A

3-Gram 97.11 1.46

Med-Find 98.36 1.27 / 1.23

4 Chars

DM Sound 3.90 N/A

3-Gram 94.96 1.86

Med-Find 96.79 1.38 / 1.31

Found (%) Rank

2 Chars

DM Sound 19.49 N/A

3-Gram 96.21 1.67

Med-Find 99.07 1.76 / 1.61

3 Chars

DM Sound 8.52 N/A

3-Gram 90.29 2.40

Med-Find 95.21 2.54 / 2.13

4 Chars

DM Sound 3.84 N/A

3-Gram 81.83 3.08

Med-Find 88.88 3.54 / 2.70

45

Page 46: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Multiple Character Correction

Replace Multiple Characters Swap Multiple Characters

Found (%) Rank

2 Chars

DM Sound 12.34 N/A

3-Gram 94.54 1.64

Med-Find 97.88 1.57 / 1.40

3 Chars

DM Sound 5.41 N/A

3-Gram 87.95 2.08

Med-Find 92.58 1.90 / 1.64

4 Chars

DM Sound 3.05 N/A

3-Gram 79.46 2.86

Med-Find 85.42 2.19 / 1.78

Found (%) Rank

2 Chars

DM Sound 15.11 N/A

3-Gram 76.36 3.99

Med-Find 82.51 3.02 / 2.25

3 Chars

DM Sound 7.60 N/A

3-Gram 61.13 5.95

Med-Find 66.85 3.89 / 2.70

4 Chars

DM Sound 5.08 N/A

3-Gram 48.91 7.51

Med-Find 54.22 4.61 / 2.87

46

Page 47: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Collaborators

Key Personnel

Michlean Amir – USHMM

Rebecca Cathey – BAE Systems

Gideon Frieder – George Washington Univ.

Jason Soo – Georgetown/MITRE

Many comments by “prototype” users

47

Page 48: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

J. Soo, R. Cathey, O. Frieder, M. Amir, and G. Frieder, “Yizkor Books: A Voice for the Silent Past,” ACM

Seventeenth Conference on Information and Knowledge Management (CIKM) – Industrial Track, Napa

Valley, California, October 2008.

J. Soo and O. Frieder, “On Foreign Name Search,” ACM Thirty-Second European Conference on

Information Retrieval (ECIR), Milton Keynes, United Kingdom, March 2010.

J. Soo and O. Frieder, “On Searching Misspelled Collections,” Journal of the Association for Information

Science and Technology (JASIS), 66(6), June 2015.

J. Soo and O. Frieder, “Revisiting Known-Item Retrieval in Degraded Document Collections," Document

Recognition and Retrieval (DRR), San Francisco, California, February 2016.

J. Soo and O. Frieder, “Searching Corrupted Document Collections," Twelfth IAPR Document

Analysis Systems (DAS), Santorini, Greece, April 2016.

References

48

Page 49: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Talk Outline

Engineering Research

Searching &

Mining Social

Media

Searching in

Adverse

Conditions

Complex

Document

Information

Processing

Computer Science

49

Page 50: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Motivation

Public health surveillance

Demands considerable human efforts

Often delayed identification

Typically: need topic of interest

Ideally: detect without focus

Motivated to expedite detection

Social media the answer?

50

Page 51: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Related Efforts

Social Media

Known topic problem

Detection of specific disease (Influenza)

Correlate occurrence of flu-related words with official

Influenza-like-illness data

Summarize influenza-related tweets

Complex solutions

Detect multiple health conditions via complex

learning algorithms

Use access-limited resources

Query logs

51

Page 52: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Hypothesis: Generation vs. Validation

Goal: extract more general health-related information from

social media streams

The Old Way:

Evaluate a pre-existing hypothesis using SM data

Q: “Is flu occurring more frequently?”

A: “Yes”

Our Way:

Generate a hypothesis from SM data

Q: “Are any illnesses occurring more frequently? If so,

which ones?”

A: “Yes, Flu”

52

Page 53: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Tweet Corpus

Collected by (JHU)

2 billion tweets (May 2009 - Oct 2010)

Filtered multiple times to yield medically related

Using a 20,000 health-related key-phrase list

High-recall / low-precision health tweets

SVM to increase precision

53

Page 54: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Framework: High Level View

54

Page 55: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Partition Corpus By Time

55

Page 56: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Frequent Word Set Identification

Preprocessing

Punctuation mark removal

Text lower-cased & tokenized

Stop-word removal

Duplicate term removal

Medical synonym expansion (MedSyn)

56

Page 57: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Frequent Word Set Identification

# Tweet Content

T1 Pounding headache, sore throat, low grade fever, flu

T2 Sleep, a perfect cure to forget about the pain!

T3 This morning woke up with fever, sore throat, and flu

T4 Cough, flu, sore throat. I couldn’t ask for a better combination

T5 Got you down? Fever , muscle aches, cough,

Term Set Support

flu, sore throat 3

fever 3

Cough 2

Frequent Term Sets: {{flu, sore throat}, {fever}} -- Threshold 3

57

Page 58: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Decide “Is Trending”

prevalence(t)isTrending(t)=(isFrequent(t))AND growth_rate

prevalence(t-1)

Word sets prevalent throughout - irrelevant

For example: {feel, sick}

Relevancy “Is Trending” word sets interest us.

58

Page 59: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Track Word Set Time Series

Time-series used to determine word sets with a

significant increase in prevalence

Two differing word set tracks by month

59

{feel, sick}

very frequent,

does not trend

{allergies, feel}

trends in April

and May

Trending

Decision

Page 60: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Query a trending word set in Wikipedia

Why Wikipedia?Comprehensive range of topics including

health topics

Written in layman’s English resembling tweets considered

60

Query Wikipedia

Page 61: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Filter Wikipedia Results

Retrieved articles determine if frequent word set is health-related

Health-related nature judged by two metrics:Ratio of medical tokens in introduction

Presence of International Statistical Classification of Diseases and Related Health Problem (ICD) codes.

61

Page 62: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Ratio of Medical Tokens

Article health-related if ratio of health tokens in introduction surpasses threshold

Process:Tokenize introduction

Remove stop words

Count the tokens and medical tokens

If # medical_token / # token > 0.75

then health-related

62

Page 63: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

ICD Codes

Health-related Wikipedia articles typically contain info-box with ICD-9 & ICD-10 codes.

ICD code – strong health-related indicator

An Wikipedia article’s info box and ICD

63

Page 64: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Detection – 2010 Flu Season

Tweet time series from June 09 to Oct 10

Weekly flu cases in US from June 09 to Oct 10

64

Page 65: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Social Media Mining Accuracy

Landing on Hudson and Mumbai Terror Attack

Flu Tweets (Lampos and Cristianini 2010; Culotta 2010)

Hurricane Sandy Coordination Communication

Fake Celebrity Deaths (Jeff Goldblum)

65

Page 66: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Sinus (Anatomy)

0

0.1

0.2

0.3

0.4

0.5

0.6

Fra

cti

on

of

Cu

mm

ula

tive S

ign

al

(%)

Sinus (anatomy)

66

Page 67: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Allergic Response

0

0.1

0.2

0.3

0.4

0.5

0.6

Fra

cti

on

of

Cu

mm

ula

tive S

ign

al

(%)

Allergic Response Sinus (anatomy)

67

Page 68: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Food Allergy

0

0.1

0.2

0.3

0.4

0.5

0.6

Fra

cti

on

of

Cu

mm

ula

tive S

ign

al

(%)

Food Allergy Allergic Response Sinus (anatomy)

68

Page 69: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Summary

Our Approach:

Filter a corpus to be topic specific

Identify trending word sets

Connect multiple trending words sets to topics of interest

Detect trending topic of interest – Generate Hypotheses

69

Page 70: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Future Work

Run framework on a larger scale

Increase data volume: 2 billion 200 billion

Increasing temporal resolution: months weeks days

Use resources besides Wikipedia and ICD to filter out non-

medically related trending topics

Detect other types of trends by changing the filters to suit a

new topic of interest

Deploy globally

70

Page 71: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Collaborators

Key Personnel

Nazli Goharian – Georgetown University

Alek Kolcz – Twitter PushD

Jon Parker – Johns Hopkins/Georgetown MITRE

Andrew Yates – Georgetown University

Many comments by “prototype” users

71

Page 72: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Reference

A. Yates, J. Parker, N. Goharian, and O. Frieder, “A Framework for Public Health Surveillance,” 9th Language Resources and Evaluation Conference (LREC-2014), Reykjavik, Iceland, May 2014.

J. Parker, A. Yates, N. Goharian, and O. Frieder, “Health Related Hypothesis Generation using Social Media Data,” Social Network Analysis and Mining, 5(7), March 2015.

A. Yates, N. Goharian, and O. Frieder, “Learning the Relationships between Drug, Symptom, and Medical Condition Mentions in Social Media,“, AAAI 10th International Conference on Web and Social Media (ICWSM), Cologne, Germany, May 2016.

A. Yates, A. Kolcz, N. Goharian, and O. Frieder, “Effects of Sampling on Twitter Trend Detection,” 10th Language Resources and Evaluation Conference (LREC-2016), Portoroz, Slovenia, May 2016.

72

Page 73: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Summary

Complex Document Information Processing

The whole is greater than the sum of its parts

Searching is easy

Unless it is in adverse (misspelled) environments

Social Media Search: Surveillance in a positive light

Detecting outbreaks in their infancy

73

Page 74: MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Thanks!

Questions?

74