View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Mining the Medical Literature
Chirag BhattOctober 14th, 2004
Why MINE data! Medical, genomics,
proteomics research Find causal links
between symptoms or diseases and drugs or chemicals
Gene comparison
An exampleProblem What is causing an
uncharacteristic behavior in protein production?
Solution Find which genes have a roll to
play in amino acid synthesis?How? Search through online literature
for genes that play a role in amino acid synthesis
Search vs. Discover
Search(goal oriented)
Discover(opportunistic)
Structured data (database)
Data retrieval
Data mining
Unstructured Data (text)
Information Retrieval
Text mining
Data Retrieval Company Database
e.g. Customer records, product inventory
Search entity (structured) records
Query (goal-driven) What is the address of our client? How many widgets are in stock?
SQL, Oracle, DB2, etc
Information Retrieval Google, A9, AltaVista Query (goal-driven) Search entity (unstructured)
documents variable format
html, pdf, etc
Data Mining Structured data set Generally a large amount of
(historical) data Find relations or patterns or trends
in database (opportunistic) Eg “beer and diaper”
Text Mining Unstructured data set
Documents, publications, abstracts, web pages
Discover useful and previously unknown “gems” of information in large text collections using patterns, trends and domain knowledge
Need for mining text Approximately 90% of the world’s data is
held in unstructured formats (source: Oracle
Corporation)
Why Text Mining in Medical Literature?
Many multi-functional genes Screen functionally interesting ones
Complexity of needs increasing Individual genes -> family of genes
Manual Text Mining ? Not really! Availability of published literature
online
Functionally Coherent Genes Group of genes that
exhibit similar experimental features Amino acid metabolism,
electron transport, stress response
Difficulties Difficulties faced in
finding functionally coherent genes Most genes express
multi-functionality Some genes studied
extensively and some only just discovered
Semantic neighbor
Two articles are semantic neighbors if they have similar word usage
Use statistical natural language processing to access and interpret online text
Methodology
Methodology Find semantic neighbors
in document set If any article about
common functionality contains atleast one in the group then the group is functionally coherant
Neighbor divergence
Scoring method Each articles
relevance to gene group is scored by: count of number
neighbors that have references to the group
Neighbor divergence scores
If score distribution is different from Poisson then gene group represents biological function
The log ratio for a Poisson distribution should be flat along
the horizontal axis
Need to filter results Generally well-studied
genes tend to have semantic neighbor that refer to same gene
Neighbor may not be relevant to group function, but increases score – false positive
So only articles that refer to different genes are considered
Evaluation Report percentile of a functional
group of genes Calculate precision and recall at
different cutoff levels (next slide) Remove legitimate genes with
irrelevant genes in group
Precision and Recall
Results Sample Space: 19 known yeast groups
and 1900 random groups
Results
Replacing functional genes
Limitations of neighbor divergence
Neighbor divergence helps group genes not tell us function
Work based on abstracts only Entire literature search may prove
challenging Break into smaller components
Another mining approach
Extracting synonymous gene and protein terms
Why find synonyms? Genes and proteins are often
associated with multiple names across articles and sub domains
More names keep getting added new functional or structural
information is discovered Improve search and analysis
Current work Biological databases such
as GenBank and SWISSPROT include synonymsNot up to dateDisagreement on some synonymsLaborious manual curation and reviewNeed for automation
Two-step problem Identifying gene and protein
names Done by state-of-the-art taggers
Determining whether these names are synonymous We’ll discuss more on this…
Current synonym approaches
Synonymous gene and protein names represent same biological substance Exhibit identical biological functions Same gene or amino acid sequences
Other approaches String matching Matching abbreviations to full-forms
Gene and Protein Tagging Identification step
Uses BLAST techniques and domain knowledge to pick out genes and protein terms
Heuristics Synonyms usually occur within
same sentence Synonyms mentioned in first
few pages of article
Synonym detection approaches Unsupervised - ‘Similarity’
based on contextual similarity Semi-supervised - ‘Snowball’
extracts structured relations using patterns
Supervised - Text Classification/SVM Hand-crafted extraction – GPE Combined system
Combined Approach Combine output of SnowBall, SVM,
and GPE Each system gives a confidence score
for each synonym pair
Where, s = <p1,p2> is a synonym pair and ConfE(s) is confidence assigned to s by individual extraction by the system E
Unsupervised - Similarity Context based
All words occurring within a ‘x’ word window
False positives are very common Run time – O(|lexicon|3)
Semi-supervised - Snowball
Manual feedback mechanism
Supervised – Text Classification
Input: known synonym pairs Automatically find contexts and
assign weights Train classifier to distinguish
between ‘positive’ and ‘negative’ contexts Eg ‘A also known as B’ and ‘A
regulates B’
Why Combined Approach? SnowBall and SVM, machine-learning
based captures synonyms that may be
missed by GPE GPE, knowledge-based
SnowBall and SVM have many false positives
Combine both advantages
Results
Summary Text mining Semantic neighbor Neighbor divergence Precision and Recall Synonym detection Approaches
Comments / Questions?