38
Mining the Medical Literature Chirag Bhatt October 14 th , 2004

Mining the Medical Literature Chirag Bhatt October 14 th, 2004

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Mining the Medical Literature

Chirag BhattOctober 14th, 2004

Page 2: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Why MINE data! Medical, genomics,

proteomics research Find causal links

between symptoms or diseases and drugs or chemicals

Gene comparison

Page 3: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

An exampleProblem What is causing an

uncharacteristic behavior in protein production?

Solution Find which genes have a roll to

play in amino acid synthesis?How? Search through online literature

for genes that play a role in amino acid synthesis

Page 4: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Search vs. Discover

Search(goal oriented)

Discover(opportunistic)

Structured data (database)

Data retrieval

Data mining

Unstructured Data (text)

Information Retrieval

Text mining

Page 5: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Data Retrieval Company Database

e.g. Customer records, product inventory

Search entity (structured) records

Query (goal-driven) What is the address of our client? How many widgets are in stock?

SQL, Oracle, DB2, etc

Page 6: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Information Retrieval Google, A9, AltaVista Query (goal-driven) Search entity (unstructured)

documents variable format

html, pdf, etc

Page 7: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Data Mining Structured data set Generally a large amount of

(historical) data Find relations or patterns or trends

in database (opportunistic) Eg “beer and diaper”

Page 8: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Text Mining Unstructured data set

Documents, publications, abstracts, web pages

Discover useful and previously unknown “gems” of information in large text collections using patterns, trends and domain knowledge

Page 9: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Need for mining text Approximately 90% of the world’s data is

held in unstructured formats (source: Oracle

Corporation)

Page 10: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Why Text Mining in Medical Literature?

Many multi-functional genes Screen functionally interesting ones

Complexity of needs increasing Individual genes -> family of genes

Manual Text Mining ? Not really! Availability of published literature

online

Page 11: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Functionally Coherent Genes Group of genes that

exhibit similar experimental features Amino acid metabolism,

electron transport, stress response

Page 12: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Difficulties Difficulties faced in

finding functionally coherent genes Most genes express

multi-functionality Some genes studied

extensively and some only just discovered

Page 13: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Semantic neighbor

Two articles are semantic neighbors if they have similar word usage

Use statistical natural language processing to access and interpret online text

Page 14: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Methodology

Page 15: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Methodology Find semantic neighbors

in document set If any article about

common functionality contains atleast one in the group then the group is functionally coherant

Page 16: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Neighbor divergence

Scoring method Each articles

relevance to gene group is scored by: count of number

neighbors that have references to the group

Page 17: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Neighbor divergence scores

If score distribution is different from Poisson then gene group represents biological function

The log ratio for a Poisson distribution should be flat along

the horizontal axis

Page 18: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Need to filter results Generally well-studied

genes tend to have semantic neighbor that refer to same gene

Neighbor may not be relevant to group function, but increases score – false positive

So only articles that refer to different genes are considered

Page 19: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Evaluation Report percentile of a functional

group of genes Calculate precision and recall at

different cutoff levels (next slide) Remove legitimate genes with

irrelevant genes in group

Page 20: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Precision and Recall

Page 21: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Results Sample Space: 19 known yeast groups

and 1900 random groups

Page 22: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Results

Page 23: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Replacing functional genes

Page 24: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Limitations of neighbor divergence

Neighbor divergence helps group genes not tell us function

Work based on abstracts only Entire literature search may prove

challenging Break into smaller components

Page 25: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Another mining approach

Extracting synonymous gene and protein terms

Page 26: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Why find synonyms? Genes and proteins are often

associated with multiple names across articles and sub domains

More names keep getting added new functional or structural

information is discovered Improve search and analysis

Page 27: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Current work Biological databases such

as GenBank and SWISSPROT include synonymsNot up to dateDisagreement on some synonymsLaborious manual curation and reviewNeed for automation

Page 28: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Two-step problem Identifying gene and protein

names Done by state-of-the-art taggers

Determining whether these names are synonymous We’ll discuss more on this…

Page 29: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Current synonym approaches

Synonymous gene and protein names represent same biological substance Exhibit identical biological functions Same gene or amino acid sequences

Other approaches String matching Matching abbreviations to full-forms

Page 30: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Gene and Protein Tagging Identification step

Uses BLAST techniques and domain knowledge to pick out genes and protein terms

Heuristics Synonyms usually occur within

same sentence Synonyms mentioned in first

few pages of article

Page 31: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Synonym detection approaches Unsupervised - ‘Similarity’

based on contextual similarity Semi-supervised - ‘Snowball’

extracts structured relations using patterns

Supervised - Text Classification/SVM Hand-crafted extraction – GPE Combined system

Page 32: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Combined Approach Combine output of SnowBall, SVM,

and GPE Each system gives a confidence score

for each synonym pair

Where, s = <p1,p2> is a synonym pair and ConfE(s) is confidence assigned to s by individual extraction by the system E

Page 33: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Unsupervised - Similarity Context based

All words occurring within a ‘x’ word window

False positives are very common Run time – O(|lexicon|3)

Page 34: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Semi-supervised - Snowball

Manual feedback mechanism

Page 35: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Supervised – Text Classification

Input: known synonym pairs Automatically find contexts and

assign weights Train classifier to distinguish

between ‘positive’ and ‘negative’ contexts Eg ‘A also known as B’ and ‘A

regulates B’

Page 36: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Why Combined Approach? SnowBall and SVM, machine-learning

based captures synonyms that may be

missed by GPE GPE, knowledge-based

SnowBall and SVM have many false positives

Combine both advantages

Page 37: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Results

Page 38: Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Summary Text mining Semantic neighbor Neighbor divergence Precision and Recall Synonym detection Approaches

Comments / Questions?