MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Searching in Harsh Environments

Ophir FriederComputer Science Dept. | Georgetown University &

Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center

[email protected] March 2016

Correcting the Search Myth

If it’s search, then Google solved it!

Some of what Google solved

Was solved by others first

Google’s focus is computerized data,

Much data are not digitized

Google is hardly a key social media player

Social media data are everywhere

2

Diverse Search Applications

Complex Document Information Processing

The whole is greater than the sum of its parts

Searching is easy

Unless it is in adverse (misspelled) environments

Social Media Search & Surveillance

Detecting outbreaks in their infancy

3

Talk Outline

Engineering Research

Searching &

Mining Social

Media

Searching in

Adverse

Conditions

Complex

Document

Information

Processing

Computer Science

4

Documents are Complex!

5

Complex documents include

handwritten notes,

diagrams,

graphics,

printed or formatted text

Point solutions exist:

OCR, Information Retrieval,

Information Extraction,

Image Processing, Text

Clustering, Computational

Stylistics, …

No definition of state-of-the-art

for the integrated problem

Manual partitioning/collating:

Expensive, time-consuming,

error-prone

Some are even more complex!

6

Optical character recognition

(OCR)

Document clustering and browsing

Document structure extractionExtraction from tables/lists

Handwriting analysis and signature

recognition

Figure caption identification

and extraction

Conventional and image retrieval

systems

Entity and relationship extraction

Existing Technology Point Solutions

7

ComplexDocumentImages

LayersOCR

Table Extraction

Logo Extraction

Signature Match

Doc Metadata

Text Extraction

Entity Tagging

CDIP MetadataDatabase

Analyst

IntegratedRetrieval

DataMining

Enhance

Correcting the Search MythCDIP Processing Architecture

8

Enhancement

9

10

Enhancement

11

Enhancement

12

Enhancement

13

Without Logos: At which institution?

Without Text: What positions do I hold?

Ophir FriederMcDevitt Prof. of Comp. Sci. & Inf. Proc.

&

Prof. of Biostatistics, Bioinformatics, & Biomathematics

Integration Helps

13

Technology comes and goes

but….

Benchmarks (Collections) are ever (forever) lasting

14

Cover the richness of inputs

Range of formats, lengths, & genres

Variance in print and image quality

Document should include: Handwritten text and notations

Diverse fonts

Graphical elements

graphs, tables, photos, logos, and diagrams

Test Collection Characteristics

15

Sufficiently high volume of documents

Vast volume of redundant & irrelevant documents

Support diverse applications Include private communications within and between

groups planning activities and deploying resources

Publicly available data! Minimal cost

Minimal licensing

16

Test Collection Characteristics

17

Data made public via legal proceedings

Master Settlement Agreement subset of UCSF Legacy Tobacco Document Library

Documents scanned by individual companies; hence scan quality widely varies

~ 7 million documents

~ 42 million scanned TIFF format pages (~ 1.5 TB)

~ 5 GB Metadata

~ 100 GB OCR

Dataset: https://ir.nist.gov/cdip/cdip-images/

17

CDIP Test Collection

The CDIP Test Collection(NIST TREC V1.0)

18

Used multiple years in TREC Legal Track

Records (62GB) made available to TREC participants (through ftp/dvd)

40 queries simulating legal case investigations with relevant judgments produced by 35 lawyers.

Novel queries with relevant judgments generated by tobacco researchers

CDIP Benchmark data – as a novel text test collection for “live scenarios”

NIST TREC Legal Track, 2006 - 2009

Housed permanently at NIST

Complex Document search

Ground truth difficult

800 hand checked sub-collection

Evaluation

19

Completed:

Subset of 800 documents

Manually labelled authorship & organizational unit

Evaluated:

Authorship, organizational, monetary, date, and

address-based retrieval tasks

Ongoing:

Subset of 20K documents.

Open Problem:

Performance evaluation (measures) for larger sets

Preliminary Results

20

System Configuration Screen

System Configuration Screen

21

22

Query: ATC Logo + “income forecast” + > $500,000

23

Query: RJR Logo + “filtration efficiency” + signature

24

Query: Five signatures with the highest dollar total

25

Query: Associations of a given person (Dr. D. Stone)

Collaborators

Initial effort Gady Agam – Illinois Inst. of Tech.

Shlomo Argamon – Illinois Inst. of Tech.

David Doermann – Univ. of Maryland DARPA

David Grossman – Illinois Inst. of Tech. Grossman Lab

David D. Lewis – DDL Consulting

Sargur Srihari – SUNY Buffalo

Ongoing effort Gideon Frieder – George Washington Univ.

Jon Parker – Georgetown Univ. MITRE

26

S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn,, and K. Voorhees, “A Complex

Document Information Processing Prototype,” ACM SIGIR, 2006.

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a Test Collection for

Complex Document Information Processing,” ACM SIGIR, 2006.

G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, “Content-Based Document Image

Retrieval in Complex Document Collections,” Document Recognition and Retrieval, 2007.

G. Bal, G. Agam, O. Frieder, and G. Frieder, “Interactive Degraded Document Enhancement and Ground

Truth Generation,” Document Recognition and Retrieval , 2008.

T. Obafemi-Ajayi, G. Agam, and O. Frieder, “Historical Document Enhancement Using LUT Classification,”

International Journal on Document Analysis and Recognition, 13(1), March 2010.

J. Parker, G. Frieder, and O. Frieder, “Automatic Enhancement and Binarization of Degraded Document

Images,” International Conference on Document Analysis and Recognition, 2013.

J. Parker, G. Frieder, and O. Frieder, "Robust Binarization of Degraded Document Images using

Heuristics," Document Recognition and Retrieval XXI, San Francisco, California, February 2014.

Parker, et. al, "System and Method for Enhancing the Legibility of Degraded Images" US Patent

#8,995,782. March 31, 2015.

Frieder, et. al, "System and Method for Enhancing the Legibility of Images," US Patent #9,269,126.

February 23, 2016.

References

27

Talk Outline


Searching &

Mining Social

Media

Searching in

Adverse

Conditions

Complex

Document

Information

Processing

Computer Science

28

Spelling in Adverse Conditions

Foreign language (Yizkor Books)

User unfamiliar with character pronunciation

Multiple languages within a document

Domain specific (Medical)

Terms unfamiliar to the general audience

29

Yizkor Books

Yizkor = Hebrew word for “remember”

Firsthand accounts of events that preceded, took place

during, and followed Second World War

Documents destroyed communities and people who perished

Started early 1940’s; highest activity in 1960’s and 1970’s

Published in 13 languages, across 6 continents

One of largest collections resides in USHMM

Access restricted due to limited number, fragile

state, and prevention of destruction or theft

30

Traditional Access

User requested; archivist driven

Requires “complete” understanding of books

High human resource costs

Inefficient & slow

Often fails to obtain complete, if any, results

31

Metadata Search Access

Provides an intuitive search capability for

apprehensive but interested users

Creates and queries collection metadata

32

Yikzor Interface

Centralized index

Global access

Efficient search

Accurate search

Multi-lingual spelling

correction

33

Search Results

34

Spell Checker

Upon entering a

misspelled query,

users are presented

with a ranked list of

suggestions

Percentages

represent similarity to

original query as

measured by our

algorithms

35

Query Processing

Language independent

string manipulation for

auto-correction via a

voting algorithm

36

Language Independent Correction

Simplistic Rules Work ! or ? Replace first and last characters by a wild card, in succession;

Retain only first and last characters and insert a wild card;

Retain only first and last two characters and insert a wild card;

Replace middle n-characters by a wild card, in succession;

Replace first half by a wild card;

Replace second half by a wild card;

37

Single Character Correction

Add Single Random Character

Remove Single Random Character

Replace Single Random Character

Swap Random Adjacent Pair of Characters

Mitton 1996 – “Spellchecking by Computers”

Found Rank

D-M Sound 41.41 N/A

N-Gram 94.97 2.58

USHMM 100 1.71

1.71

Found Rank

D-M Sound 41.96 N/A

N-Gram 93.40 3.46

USHMM 99.97 2.54

2.57

Found Rank

D-M Sound 57.89 N/A

N-Gram 85.02 4.77

USHMM 97.97 3.75

3.00

Found Rank

D-M Sound 31.45 N/A

N-Gram 92.06 3.24

USHMM 100 2.15

2.01

38

Multiple Character Correction

Add Multiple Characters Remove Multiple Characters

Found (%) Rank

2 Chars

DM Sound 19.58 N/A

N-Gram 92.00 3.45

USHMM 99.38 2.55 / 2.42

3 Chars

DM Sound 10.69 N/A

N-Gram 87.91 4.20

USHMM 97.53 3.19 / 3.02

4 Chars

DM Sound 6.75 N/A

N-Gram 83.86 4.97

USHMM 95.04 3.87 / 3.80

Found (%) Rank

2 Chars

DM Sound 20.47 N/A

N-Gram 84.79 4.78

USHMM 97.83 4.62 / 3.88

3 Chars

DM Sound 10.75 N/A

N-Gram 74.48 5.77

USHMM 92.73 6.41 / 4.80

4 Chars

DM Sound 9.70 N/A

N-Gram 69.98 6.04

USHMM 86.34 7.12 / 5.15

39


Replace Multiple Characters Swap Multiple Characters

Found (%) Rank

2 Chars

DM Sound 16.80 N/A

N-Gram 80.73 4.44

USHMM 93.88 4.19 / 3.33

3 Chars

DM Sound 9.11 N/A

N-Gram 69.23 5.15

USHMM 85.83 5.51 / 3.84

4 Chars

DM Sound 5.63 N/A

N-Gram 57.83 5.94

USHMM 75.03 6.79 / 4.78

Found (%) Rank

2 Chars

DM Sound 17.33 N/A

N-Gram 54.66 6.92

USHMM 71.69 7.55 / 5.46

3 Chars

DM Sound 9.19 N/A

N-Gram 42.91 7.30

USHMM 57.65 8.61 / 6.19

4 Chars

DM Sound 7.15 N/A

N-Gram 34.42 8.60

USHMM 46.31 9.32 / 7.30

40

Applying operational technology to a

medical domain…

Corrected spelling within a

Medical Terms Dictionary

41

Transcription Errors

“What is a prescribing error?”, J. Quality in Health Care, 2000;

9:232–237.

“Reducing medication errors and increasing patient safety: Case

studies in clinical pharmacology”, J. Clinical Pharmacology, July

2003. vol. 43 no. 7: 768-783.

“Preventing medication errors in community pharmacy: root-cause

analysis of transcription errors”, Quality and Safety in Health Care,

2007;16:285-290.

“10 strategies for minimizingdispensingerrors”, Pharmacy Times, Jan.

20th, 2010

Note: Although many of the transcription errors are

not spelling errors; some indeed are!

42

Medical Term Data Set

HosfordMedical Terms Dictionary v.3.0

Number of terms: 9,883

Term characteristics:

Average: 10.58

Minimum: 2

Maximum: 30

Median: 10

Mode: 10

43

Single Character Correction

Add Single Random Character

Remove Single Random Character

Replace Single Random Character

Swap Random Adjacent Pair of Characters

Found Rank

D-M Sound 38.54 N/A

3-Gram 99.67 1.08

Med-Find 100 1.03

1.03

Found Rank

D-M Sound 44.84 N/A

3-Gram 99.52 1.16

Med-Find 100 1.07

1.07

Found Rank

D-M Sound 62.73 N/A

3-Gram 96.39 1.50

Med-Find 99.54 1.42

1.27

Found Rank

D-M Sound 29.99 N/A

3-Gram 98.76 1.19

Med-Find 99.99 1.10

1.08

44


Add Multiple Characters Remove Multiple Characters

Found (%) Rank

2 Chars

DM Sound 16.40 N/A

3-Gram 98.48 1.29

Med-Find 99.55 1.17 / 1.15

3 Chars

DM Sound 7.00 N/A

3-Gram 97.11 1.46

Med-Find 98.36 1.27 / 1.23

4 Chars

DM Sound 3.90 N/A

3-Gram 94.96 1.86

Med-Find 96.79 1.38 / 1.31

Found (%) Rank

2 Chars

DM Sound 19.49 N/A

3-Gram 96.21 1.67

Med-Find 99.07 1.76 / 1.61

3 Chars

DM Sound 8.52 N/A

3-Gram 90.29 2.40

Med-Find 95.21 2.54 / 2.13

4 Chars

DM Sound 3.84 N/A

3-Gram 81.83 3.08

Med-Find 88.88 3.54 / 2.70

45


Replace Multiple Characters Swap Multiple Characters

Found (%) Rank

2 Chars

DM Sound 12.34 N/A

3-Gram 94.54 1.64

Med-Find 97.88 1.57 / 1.40

3 Chars

DM Sound 5.41 N/A

3-Gram 87.95 2.08

Med-Find 92.58 1.90 / 1.64

4 Chars

DM Sound 3.05 N/A

3-Gram 79.46 2.86

Med-Find 85.42 2.19 / 1.78

Found (%) Rank

2 Chars

DM Sound 15.11 N/A

3-Gram 76.36 3.99

Med-Find 82.51 3.02 / 2.25

3 Chars

DM Sound 7.60 N/A

3-Gram 61.13 5.95

Med-Find 66.85 3.89 / 2.70

4 Chars

DM Sound 5.08 N/A

3-Gram 48.91 7.51

Med-Find 54.22 4.61 / 2.87

46

Collaborators

Key Personnel

Michlean Amir – USHMM

Rebecca Cathey – BAE Systems

Gideon Frieder – George Washington Univ.

Jason Soo – Georgetown/MITRE

Many comments by “prototype” users

47

J. Soo, R. Cathey, O. Frieder, M. Amir, and G. Frieder, “Yizkor Books: A Voice for the Silent Past,” ACM

Seventeenth Conference on Information and Knowledge Management (CIKM) – Industrial Track, Napa

Valley, California, October 2008.

J. Soo and O. Frieder, “On Foreign Name Search,” ACM Thirty-Second European Conference on

Information Retrieval (ECIR), Milton Keynes, United Kingdom, March 2010.

J. Soo and O. Frieder, “On Searching Misspelled Collections,” Journal of the Association for Information

Science and Technology (JASIS), 66(6), June 2015.

J. Soo and O. Frieder, “Revisiting Known-Item Retrieval in Degraded Document Collections," Document

Recognition and Retrieval (DRR), San Francisco, California, February 2016.

J. Soo and O. Frieder, “Searching Corrupted Document Collections," Twelfth IAPR Document

Analysis Systems (DAS), Santorini, Greece, April 2016.

References

48

Talk Outline


Searching &

Mining Social

Media

Searching in

Adverse

Conditions

Complex

Document

Information

Processing

Computer Science

49

Motivation

Public health surveillance

Demands considerable human efforts

Often delayed identification

Typically: need topic of interest

Ideally: detect without focus

Motivated to expedite detection

Social media the answer?

50

Related Efforts

Social Media

Known topic problem

Detection of specific disease (Influenza)

Correlate occurrence of flu-related words with official

Influenza-like-illness data

Summarize influenza-related tweets

Complex solutions

Detect multiple health conditions via complex

learning algorithms

Use access-limited resources

Query logs

51

Hypothesis: Generation vs. Validation

Goal: extract more general health-related information from

social media streams

The Old Way:

Evaluate a pre-existing hypothesis using SM data

Q: “Is flu occurring more frequently?”

A: “Yes”

Our Way:

Generate a hypothesis from SM data

Q: “Are any illnesses occurring more frequently? If so,

which ones?”

A: “Yes, Flu”

52

Tweet Corpus

Collected by (JHU)

2 billion tweets (May 2009 - Oct 2010)

Filtered multiple times to yield medically related

Using a 20,000 health-related key-phrase list

High-recall / low-precision health tweets

SVM to increase precision

53

Framework: High Level View

54

Partition Corpus By Time

55

Frequent Word Set Identification

Preprocessing

Punctuation mark removal

Text lower-cased & tokenized

Stop-word removal

Duplicate term removal

Medical synonym expansion (MedSyn)

56

Frequent Word Set Identification

# Tweet Content

T1 Pounding headache, sore throat, low grade fever, flu

T2 Sleep, a perfect cure to forget about the pain!

T3 This morning woke up with fever, sore throat, and flu

T4 Cough, flu, sore throat. I couldn’t ask for a better combination

T5 Got you down? Fever , muscle aches, cough,

Term Set Support

flu, sore throat 3

fever 3

Cough 2

Frequent Term Sets: {{flu, sore throat}, {fever}} -- Threshold 3

57

Decide “Is Trending”

prevalence(t)isTrending(t)=(isFrequent(t))AND growth_rate

prevalence(t-1)

Word sets prevalent throughout - irrelevant

For example: {feel, sick}

Relevancy “Is Trending” word sets interest us.

58

Track Word Set Time Series

Time-series used to determine word sets with a

significant increase in prevalence

Two differing word set tracks by month

59

{feel, sick}

very frequent,

does not trend

{allergies, feel}

trends in April

and May

Trending

Decision

Query a trending word set in Wikipedia

Why Wikipedia?Comprehensive range of topics including

health topics

Written in layman’s English resembling tweets considered

60

Query Wikipedia

Filter Wikipedia Results

Retrieved articles determine if frequent word set is health-related

Health-related nature judged by two metrics:Ratio of medical tokens in introduction

Presence of International Statistical Classification of Diseases and Related Health Problem (ICD) codes.

61

Ratio of Medical Tokens

Article health-related if ratio of health tokens in introduction surpasses threshold

Process:Tokenize introduction

Remove stop words

Count the tokens and medical tokens

If # medical_token / # token > 0.75

then health-related

62

ICD Codes

Health-related Wikipedia articles typically contain info-box with ICD-9 & ICD-10 codes.

ICD code – strong health-related indicator

An Wikipedia article’s info box and ICD

63

Detection – 2010 Flu Season

Tweet time series from June 09 to Oct 10

Weekly flu cases in US from June 09 to Oct 10

64

Social Media Mining Accuracy

Landing on Hudson and Mumbai Terror Attack

Flu Tweets (Lampos and Cristianini 2010; Culotta 2010)

…

Hurricane Sandy Coordination Communication

…

…

Fake Celebrity Deaths (Jeff Goldblum)

65

Sinus (Anatomy)

0

0.1

0.2

0.3

0.4

0.5

0.6

Fra

cti

on

of

Cu

mm

ula

tive S

ign

al

(%)

Sinus (anatomy)

66

Allergic Response

0

0.1

0.2

0.3

0.4

0.5

0.6

Fra

cti

on

of

Cu

mm

ula

tive S

ign

al

(%)

Allergic Response Sinus (anatomy)

67

Food Allergy

0

0.1

0.2

0.3

0.4

0.5

0.6

Fra

cti

on

of

Cu

mm

ula

tive S

ign

al

(%)

Food Allergy Allergic Response Sinus (anatomy)

68

Summary

Our Approach:

Filter a corpus to be topic specific

Identify trending word sets

Connect multiple trending words sets to topics of interest

Detect trending topic of interest – Generate Hypotheses

69

Future Work

Run framework on a larger scale

Increase data volume: 2 billion 200 billion

Increasing temporal resolution: months weeks days

Use resources besides Wikipedia and ICD to filter out non-

medically related trending topics

Detect other types of trends by changing the filters to suit a

new topic of interest

Deploy globally

70

Collaborators

Key Personnel

Nazli Goharian – Georgetown University

Alek Kolcz – Twitter PushD

Jon Parker – Johns Hopkins/Georgetown MITRE

Andrew Yates – Georgetown University

Many comments by “prototype” users

71

Reference

A. Yates, J. Parker, N. Goharian, and O. Frieder, “A Framework for Public Health Surveillance,” 9th Language Resources and Evaluation Conference (LREC-2014), Reykjavik, Iceland, May 2014.

J. Parker, A. Yates, N. Goharian, and O. Frieder, “Health Related Hypothesis Generation using Social Media Data,” Social Network Analysis and Mining, 5(7), March 2015.

A. Yates, N. Goharian, and O. Frieder, “Learning the Relationships between Drug, Symptom, and Medical Condition Mentions in Social Media,“, AAAI 10th International Conference on Web and Social Media (ICWSM), Cologne, Germany, May 2016.

A. Yates, A. Kolcz, N. Goharian, and O. Frieder, “Effects of Sampling on Twitter Trend Detection,” 10th Language Resources and Evaluation Conference (LREC-2016), Portoroz, Slovenia, May 2016.

72

Summary

Complex Document Information Processing

The whole is greater than the sum of its parts

Searching is easy

Unless it is in adverse (misspelled) environments

Social Media Search: Surveillance in a positive light

Detecting outbreaks in their infancy

73

Thanks!

Questions?

74