Upload
micah-altman
View
1.798
Download
0
Embed Size (px)
Citation preview
Searching in Harsh Environments
Ophir FriederComputer Science Dept. | Georgetown University &
Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center
[email protected] March 2016
Correcting the Search Myth
If it’s search, then Google solved it!
Some of what Google solved
Was solved by others first
Google’s focus is computerized data,
Much data are not digitized
Google is hardly a key social media player
Social media data are everywhere
2
Diverse Search Applications
Complex Document Information Processing
The whole is greater than the sum of its parts
Searching is easy
Unless it is in adverse (misspelled) environments
Social Media Search & Surveillance
Detecting outbreaks in their infancy
3
Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
4
Documents are Complex!
5
Complex documents include
handwritten notes,
diagrams,
graphics,
printed or formatted text
Point solutions exist:
OCR, Information Retrieval,
Information Extraction,
Image Processing, Text
Clustering, Computational
Stylistics, …
No definition of state-of-the-art
for the integrated problem
Manual partitioning/collating:
Expensive, time-consuming,
error-prone
Some are even more complex!
6
Optical character recognition
(OCR)
Document clustering and browsing
Document structure extractionExtraction from tables/lists
Handwriting analysis and signature
recognition
Figure caption identification
and extraction
Conventional and image retrieval
systems
Entity and relationship extraction
Existing Technology Point Solutions
7
ComplexDocumentImages
LayersOCR
Table Extraction
Logo Extraction
Signature Match
Doc Metadata
Text Extraction
Entity Tagging
CDIP MetadataDatabase
Analyst
IntegratedRetrieval
DataMining
Enhance
Correcting the Search MythCDIP Processing Architecture
8
Enhancement
9
10
Enhancement
11
Enhancement
12
Enhancement
13
Without Logos: At which institution?
Without Text: What positions do I hold?
Ophir FriederMcDevitt Prof. of Comp. Sci. & Inf. Proc.
&
Prof. of Biostatistics, Bioinformatics, & Biomathematics
Integration Helps
13
Technology comes and goes
but….
Benchmarks (Collections) are ever (forever) lasting
14
Cover the richness of inputs
Range of formats, lengths, & genres
Variance in print and image quality
Document should include: Handwritten text and notations
Diverse fonts
Graphical elements
graphs, tables, photos, logos, and diagrams
Test Collection Characteristics
15
Sufficiently high volume of documents
Vast volume of redundant & irrelevant documents
Support diverse applications Include private communications within and between
groups planning activities and deploying resources
Publicly available data! Minimal cost
Minimal licensing
16
Test Collection Characteristics
17
Data made public via legal proceedings
Master Settlement Agreement subset of UCSF Legacy Tobacco Document Library
Documents scanned by individual companies; hence scan quality widely varies
~ 7 million documents
~ 42 million scanned TIFF format pages (~ 1.5 TB)
~ 5 GB Metadata
~ 100 GB OCR
Dataset: https://ir.nist.gov/cdip/cdip-images/
17
CDIP Test Collection
The CDIP Test Collection(NIST TREC V1.0)
18
Used multiple years in TREC Legal Track
Records (62GB) made available to TREC participants (through ftp/dvd)
40 queries simulating legal case investigations with relevant judgments produced by 35 lawyers.
Novel queries with relevant judgments generated by tobacco researchers
CDIP Benchmark data – as a novel text test collection for “live scenarios”
NIST TREC Legal Track, 2006 - 2009
Housed permanently at NIST
Complex Document search
Ground truth difficult
800 hand checked sub-collection
Evaluation
19
Completed:
Subset of 800 documents
Manually labelled authorship & organizational unit
Evaluated:
Authorship, organizational, monetary, date, and
address-based retrieval tasks
Ongoing:
Subset of 20K documents.
Open Problem:
Performance evaluation (measures) for larger sets
Preliminary Results
20
System Configuration Screen
System Configuration Screen
21
22
Query: ATC Logo + “income forecast” + > $500,000
23
Query: RJR Logo + “filtration efficiency” + signature
24
Query: Five signatures with the highest dollar total
25
Query: Associations of a given person (Dr. D. Stone)
Collaborators
Initial effort Gady Agam – Illinois Inst. of Tech.
Shlomo Argamon – Illinois Inst. of Tech.
David Doermann – Univ. of Maryland DARPA
David Grossman – Illinois Inst. of Tech. Grossman Lab
David D. Lewis – DDL Consulting
Sargur Srihari – SUNY Buffalo
Ongoing effort Gideon Frieder – George Washington Univ.
Jon Parker – Georgetown Univ. MITRE
26
S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn,, and K. Voorhees, “A Complex
Document Information Processing Prototype,” ACM SIGIR, 2006.
D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a Test Collection for
Complex Document Information Processing,” ACM SIGIR, 2006.
G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, “Content-Based Document Image
Retrieval in Complex Document Collections,” Document Recognition and Retrieval, 2007.
G. Bal, G. Agam, O. Frieder, and G. Frieder, “Interactive Degraded Document Enhancement and Ground
Truth Generation,” Document Recognition and Retrieval , 2008.
T. Obafemi-Ajayi, G. Agam, and O. Frieder, “Historical Document Enhancement Using LUT Classification,”
International Journal on Document Analysis and Recognition, 13(1), March 2010.
J. Parker, G. Frieder, and O. Frieder, “Automatic Enhancement and Binarization of Degraded Document
Images,” International Conference on Document Analysis and Recognition, 2013.
J. Parker, G. Frieder, and O. Frieder, "Robust Binarization of Degraded Document Images using
Heuristics," Document Recognition and Retrieval XXI, San Francisco, California, February 2014.
Parker, et. al, "System and Method for Enhancing the Legibility of Degraded Images" US Patent
#8,995,782. March 31, 2015.
Frieder, et. al, "System and Method for Enhancing the Legibility of Images," US Patent #9,269,126.
February 23, 2016.
References
27
Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
28
Spelling in Adverse Conditions
Foreign language (Yizkor Books)
User unfamiliar with character pronunciation
Multiple languages within a document
Domain specific (Medical)
Terms unfamiliar to the general audience
29
Yizkor Books
Yizkor = Hebrew word for “remember”
Firsthand accounts of events that preceded, took place
during, and followed Second World War
Documents destroyed communities and people who perished
Started early 1940’s; highest activity in 1960’s and 1970’s
Published in 13 languages, across 6 continents
One of largest collections resides in USHMM
Access restricted due to limited number, fragile
state, and prevention of destruction or theft
30
Traditional Access
User requested; archivist driven
Requires “complete” understanding of books
High human resource costs
Inefficient & slow
Often fails to obtain complete, if any, results
31
Metadata Search Access
Provides an intuitive search capability for
apprehensive but interested users
Creates and queries collection metadata
32
Yikzor Interface
Centralized index
Global access
Efficient search
Accurate search
Multi-lingual spelling
correction
33
Search Results
34
Spell Checker
Upon entering a
misspelled query,
users are presented
with a ranked list of
suggestions
Percentages
represent similarity to
original query as
measured by our
algorithms
35
Query Processing
Language independent
string manipulation for
auto-correction via a
voting algorithm
36
Language Independent Correction
Simplistic Rules Work ! or ? Replace first and last characters by a wild card, in succession;
Retain only first and last characters and insert a wild card;
Retain only first and last two characters and insert a wild card;
Replace middle n-characters by a wild card, in succession;
Replace first half by a wild card;
Replace second half by a wild card;
37
Single Character Correction
Add Single Random Character
Remove Single Random Character
Replace Single Random Character
Swap Random Adjacent Pair of Characters
Mitton 1996 – “Spellchecking by Computers”
Found Rank
D-M Sound 41.41 N/A
N-Gram 94.97 2.58
USHMM 100 1.71
1.71
Found Rank
D-M Sound 41.96 N/A
N-Gram 93.40 3.46
USHMM 99.97 2.54
2.57
Found Rank
D-M Sound 57.89 N/A
N-Gram 85.02 4.77
USHMM 97.97 3.75
3.00
Found Rank
D-M Sound 31.45 N/A
N-Gram 92.06 3.24
USHMM 100 2.15
2.01
38
Multiple Character Correction
Add Multiple Characters Remove Multiple Characters
Found (%) Rank
2 Chars
DM Sound 19.58 N/A
N-Gram 92.00 3.45
USHMM 99.38 2.55 / 2.42
3 Chars
DM Sound 10.69 N/A
N-Gram 87.91 4.20
USHMM 97.53 3.19 / 3.02
4 Chars
DM Sound 6.75 N/A
N-Gram 83.86 4.97
USHMM 95.04 3.87 / 3.80
Found (%) Rank
2 Chars
DM Sound 20.47 N/A
N-Gram 84.79 4.78
USHMM 97.83 4.62 / 3.88
3 Chars
DM Sound 10.75 N/A
N-Gram 74.48 5.77
USHMM 92.73 6.41 / 4.80
4 Chars
DM Sound 9.70 N/A
N-Gram 69.98 6.04
USHMM 86.34 7.12 / 5.15
39
Multiple Character Correction
Replace Multiple Characters Swap Multiple Characters
Found (%) Rank
2 Chars
DM Sound 16.80 N/A
N-Gram 80.73 4.44
USHMM 93.88 4.19 / 3.33
3 Chars
DM Sound 9.11 N/A
N-Gram 69.23 5.15
USHMM 85.83 5.51 / 3.84
4 Chars
DM Sound 5.63 N/A
N-Gram 57.83 5.94
USHMM 75.03 6.79 / 4.78
Found (%) Rank
2 Chars
DM Sound 17.33 N/A
N-Gram 54.66 6.92
USHMM 71.69 7.55 / 5.46
3 Chars
DM Sound 9.19 N/A
N-Gram 42.91 7.30
USHMM 57.65 8.61 / 6.19
4 Chars
DM Sound 7.15 N/A
N-Gram 34.42 8.60
USHMM 46.31 9.32 / 7.30
40
Applying operational technology to a
medical domain…
Corrected spelling within a
Medical Terms Dictionary
41
Transcription Errors
“What is a prescribing error?”, J. Quality in Health Care, 2000;
9:232–237.
“Reducing medication errors and increasing patient safety: Case
studies in clinical pharmacology”, J. Clinical Pharmacology, July
2003. vol. 43 no. 7: 768-783.
“Preventing medication errors in community pharmacy: root-cause
analysis of transcription errors”, Quality and Safety in Health Care,
2007;16:285-290.
“10 strategies for minimizingdispensingerrors”, Pharmacy Times, Jan.
20th, 2010
Note: Although many of the transcription errors are
not spelling errors; some indeed are!
42
Medical Term Data Set
HosfordMedical Terms Dictionary v.3.0
Number of terms: 9,883
Term characteristics:
Average: 10.58
Minimum: 2
Maximum: 30
Median: 10
Mode: 10
43
Single Character Correction
Add Single Random Character
Remove Single Random Character
Replace Single Random Character
Swap Random Adjacent Pair of Characters
Found Rank
D-M Sound 38.54 N/A
3-Gram 99.67 1.08
Med-Find 100 1.03
1.03
Found Rank
D-M Sound 44.84 N/A
3-Gram 99.52 1.16
Med-Find 100 1.07
1.07
Found Rank
D-M Sound 62.73 N/A
3-Gram 96.39 1.50
Med-Find 99.54 1.42
1.27
Found Rank
D-M Sound 29.99 N/A
3-Gram 98.76 1.19
Med-Find 99.99 1.10
1.08
44
Multiple Character Correction
Add Multiple Characters Remove Multiple Characters
Found (%) Rank
2 Chars
DM Sound 16.40 N/A
3-Gram 98.48 1.29
Med-Find 99.55 1.17 / 1.15
3 Chars
DM Sound 7.00 N/A
3-Gram 97.11 1.46
Med-Find 98.36 1.27 / 1.23
4 Chars
DM Sound 3.90 N/A
3-Gram 94.96 1.86
Med-Find 96.79 1.38 / 1.31
Found (%) Rank
2 Chars
DM Sound 19.49 N/A
3-Gram 96.21 1.67
Med-Find 99.07 1.76 / 1.61
3 Chars
DM Sound 8.52 N/A
3-Gram 90.29 2.40
Med-Find 95.21 2.54 / 2.13
4 Chars
DM Sound 3.84 N/A
3-Gram 81.83 3.08
Med-Find 88.88 3.54 / 2.70
45
Multiple Character Correction
Replace Multiple Characters Swap Multiple Characters
Found (%) Rank
2 Chars
DM Sound 12.34 N/A
3-Gram 94.54 1.64
Med-Find 97.88 1.57 / 1.40
3 Chars
DM Sound 5.41 N/A
3-Gram 87.95 2.08
Med-Find 92.58 1.90 / 1.64
4 Chars
DM Sound 3.05 N/A
3-Gram 79.46 2.86
Med-Find 85.42 2.19 / 1.78
Found (%) Rank
2 Chars
DM Sound 15.11 N/A
3-Gram 76.36 3.99
Med-Find 82.51 3.02 / 2.25
3 Chars
DM Sound 7.60 N/A
3-Gram 61.13 5.95
Med-Find 66.85 3.89 / 2.70
4 Chars
DM Sound 5.08 N/A
3-Gram 48.91 7.51
Med-Find 54.22 4.61 / 2.87
46
Collaborators
Key Personnel
Michlean Amir – USHMM
Rebecca Cathey – BAE Systems
Gideon Frieder – George Washington Univ.
Jason Soo – Georgetown/MITRE
Many comments by “prototype” users
47
J. Soo, R. Cathey, O. Frieder, M. Amir, and G. Frieder, “Yizkor Books: A Voice for the Silent Past,” ACM
Seventeenth Conference on Information and Knowledge Management (CIKM) – Industrial Track, Napa
Valley, California, October 2008.
J. Soo and O. Frieder, “On Foreign Name Search,” ACM Thirty-Second European Conference on
Information Retrieval (ECIR), Milton Keynes, United Kingdom, March 2010.
J. Soo and O. Frieder, “On Searching Misspelled Collections,” Journal of the Association for Information
Science and Technology (JASIS), 66(6), June 2015.
J. Soo and O. Frieder, “Revisiting Known-Item Retrieval in Degraded Document Collections," Document
Recognition and Retrieval (DRR), San Francisco, California, February 2016.
J. Soo and O. Frieder, “Searching Corrupted Document Collections," Twelfth IAPR Document
Analysis Systems (DAS), Santorini, Greece, April 2016.
References
48
Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
49
Motivation
Public health surveillance
Demands considerable human efforts
Often delayed identification
Typically: need topic of interest
Ideally: detect without focus
Motivated to expedite detection
Social media the answer?
50
Related Efforts
Social Media
Known topic problem
Detection of specific disease (Influenza)
Correlate occurrence of flu-related words with official
Influenza-like-illness data
Summarize influenza-related tweets
Complex solutions
Detect multiple health conditions via complex
learning algorithms
Use access-limited resources
Query logs
51
Hypothesis: Generation vs. Validation
Goal: extract more general health-related information from
social media streams
The Old Way:
Evaluate a pre-existing hypothesis using SM data
Q: “Is flu occurring more frequently?”
A: “Yes”
Our Way:
Generate a hypothesis from SM data
Q: “Are any illnesses occurring more frequently? If so,
which ones?”
A: “Yes, Flu”
52
Tweet Corpus
Collected by (JHU)
2 billion tweets (May 2009 - Oct 2010)
Filtered multiple times to yield medically related
Using a 20,000 health-related key-phrase list
High-recall / low-precision health tweets
SVM to increase precision
53
Framework: High Level View
54
Partition Corpus By Time
55
Frequent Word Set Identification
Preprocessing
Punctuation mark removal
Text lower-cased & tokenized
Stop-word removal
Duplicate term removal
Medical synonym expansion (MedSyn)
56
Frequent Word Set Identification
# Tweet Content
T1 Pounding headache, sore throat, low grade fever, flu
T2 Sleep, a perfect cure to forget about the pain!
T3 This morning woke up with fever, sore throat, and flu
T4 Cough, flu, sore throat. I couldn’t ask for a better combination
T5 Got you down? Fever , muscle aches, cough,
Term Set Support
flu, sore throat 3
fever 3
Cough 2
Frequent Term Sets: {{flu, sore throat}, {fever}} -- Threshold 3
57
Decide “Is Trending”
prevalence(t)isTrending(t)=(isFrequent(t))AND growth_rate
prevalence(t-1)
Word sets prevalent throughout - irrelevant
For example: {feel, sick}
Relevancy “Is Trending” word sets interest us.
58
Track Word Set Time Series
Time-series used to determine word sets with a
significant increase in prevalence
Two differing word set tracks by month
59
{feel, sick}
very frequent,
does not trend
{allergies, feel}
trends in April
and May
Trending
Decision
Query a trending word set in Wikipedia
Why Wikipedia?Comprehensive range of topics including
health topics
Written in layman’s English resembling tweets considered
60
Query Wikipedia
Filter Wikipedia Results
Retrieved articles determine if frequent word set is health-related
Health-related nature judged by two metrics:Ratio of medical tokens in introduction
Presence of International Statistical Classification of Diseases and Related Health Problem (ICD) codes.
61
Ratio of Medical Tokens
Article health-related if ratio of health tokens in introduction surpasses threshold
Process:Tokenize introduction
Remove stop words
Count the tokens and medical tokens
If # medical_token / # token > 0.75
then health-related
62
ICD Codes
Health-related Wikipedia articles typically contain info-box with ICD-9 & ICD-10 codes.
ICD code – strong health-related indicator
An Wikipedia article’s info box and ICD
63
Detection – 2010 Flu Season
Tweet time series from June 09 to Oct 10
Weekly flu cases in US from June 09 to Oct 10
64
Social Media Mining Accuracy
Landing on Hudson and Mumbai Terror Attack
Flu Tweets (Lampos and Cristianini 2010; Culotta 2010)
…
Hurricane Sandy Coordination Communication
…
…
Fake Celebrity Deaths (Jeff Goldblum)
65
Sinus (Anatomy)
0
0.1
0.2
0.3
0.4
0.5
0.6
Fra
cti
on
of
Cu
mm
ula
tive S
ign
al
(%)
Sinus (anatomy)
66
Allergic Response
0
0.1
0.2
0.3
0.4
0.5
0.6
Fra
cti
on
of
Cu
mm
ula
tive S
ign
al
(%)
Allergic Response Sinus (anatomy)
67
Food Allergy
0
0.1
0.2
0.3
0.4
0.5
0.6
Fra
cti
on
of
Cu
mm
ula
tive S
ign
al
(%)
Food Allergy Allergic Response Sinus (anatomy)
68
Summary
Our Approach:
Filter a corpus to be topic specific
Identify trending word sets
Connect multiple trending words sets to topics of interest
Detect trending topic of interest – Generate Hypotheses
69
Future Work
Run framework on a larger scale
Increase data volume: 2 billion 200 billion
Increasing temporal resolution: months weeks days
Use resources besides Wikipedia and ICD to filter out non-
medically related trending topics
Detect other types of trends by changing the filters to suit a
new topic of interest
Deploy globally
70
Collaborators
Key Personnel
Nazli Goharian – Georgetown University
Alek Kolcz – Twitter PushD
Jon Parker – Johns Hopkins/Georgetown MITRE
Andrew Yates – Georgetown University
Many comments by “prototype” users
71
Reference
A. Yates, J. Parker, N. Goharian, and O. Frieder, “A Framework for Public Health Surveillance,” 9th Language Resources and Evaluation Conference (LREC-2014), Reykjavik, Iceland, May 2014.
J. Parker, A. Yates, N. Goharian, and O. Frieder, “Health Related Hypothesis Generation using Social Media Data,” Social Network Analysis and Mining, 5(7), March 2015.
A. Yates, N. Goharian, and O. Frieder, “Learning the Relationships between Drug, Symptom, and Medical Condition Mentions in Social Media,“, AAAI 10th International Conference on Web and Social Media (ICWSM), Cologne, Germany, May 2016.
A. Yates, A. Kolcz, N. Goharian, and O. Frieder, “Effects of Sampling on Twitter Trend Detection,” 10th Language Resources and Evaluation Conference (LREC-2016), Portoroz, Slovenia, May 2016.
72
Summary
Complex Document Information Processing
The whole is greater than the sum of its parts
Searching is easy
Unless it is in adverse (misspelled) environments
Social Media Search: Surveillance in a positive light
Detecting outbreaks in their infancy
73
Thanks!
Questions?
74