23
1 Unstructured Machine Learning: Providing the link between Genetic Data and Published Research Dr Tony C Smith Dr Tony C Smith Reel Two, Inc. Reel Two, Inc. 9 Hartley Street 9 Hartley Street Hamilton, New Zealand Hamilton, New Zealand +64 7 839 7808 +64 7 839 7808 www.reeltwo.com www.reeltwo.com

Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

Embed Size (px)

DESCRIPTION

Unstructured Machine Learning: Providing the link between Genetic Data and Published Research. Dr Tony C Smith Reel Two, Inc. 9 Hartley Street Hamilton, New Zealand +64 7 839 7808 www.reeltwo.com. What is Machine Learning?. creating computer programs that get better with experience - PowerPoint PPT Presentation

Citation preview

Page 1: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

1

Unstructured Machine Learning:

Providing the link between Genetic Data and Published Research

Dr Tony C SmithDr Tony C Smith

Reel Two, Inc.Reel Two, Inc.9 Hartley Street9 Hartley Street

Hamilton, New ZealandHamilton, New Zealand+64 7 839 7808+64 7 839 7808

www.reeltwo.comwww.reeltwo.com

Page 2: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

2

What is Machine Learning?

creating computer programs that get better with experience

learn how to make expert judgments

discover previously hidden, potentially useful information (data mining)

How does it work?

user provides learning system with examples of concept to be learned

induction algorithm infers a characteristic model of the examples

model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly!

Page 3: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

3

Structured Learning

WeightWeight DamageDamage DirtDirt FirmnessFirmness QualityQuality

heavy high mild hard poorheavy high mild soft poornormal high mild hard goodlight medium mild hard goodLight clear clean hard goodnormal clear clean soft poorheavy medium mild hard poor. . .

Mushroom DataMushroom Data

weightweight

goodgooddirtdirt firmnessfirmness

poorpoor

heavyheavy lightlight normalnormal

mildmild cleanclean hardhard softsoft

poorpoorgoodgood goodgood

Page 4: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

4

Unstructured Learning

data does not have fixed fields with specific values

examples: images, continuous signals, expression data, text

learning proceeds by correlating the presence or absence of any and all salient attributes

Document Classification

given examples of documents covering some topic, learn a semantic model that can recognize whether or not other documents are relevant

prioritize them: i.e. quantify “how relevant” documents are to the topic

not limited to keywords (nor is it misled by them)

adapt to the user’s needs (ephemeral or long-term)

Page 5: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

5

How Text Mining Works

Users supply the system with training data• Documents that are good examples of the desired category

The system builds ‘classifiers’• Statistical models based on the training data

The system classifies novel data• Identifies other documents about the desired category

Results are displayed or stored• Files can be viewed, routed to end users or stored in databases

Page 6: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

6

Classification System

Client-specific categoriesFamiliar Windows-style interface

Drag-and-drop documents to create custom categories

Classified documents are ranked by relevance

View contents of individual documents – sentences are highlighted by their relevance to

the category

Page 7: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

7

The Initial Problem: Individual curators evaluate data differently

ProteinModification

MAPK-KKCascade

Activation of p38 MAP Kinase

While scientists can agree to use the word "kinase," they must also agree to support this by stating how and why they use "kinase," and consistently apply it. Only in this way can they hope to compare gene products and find out if and how they are related.

The Gene Ontology – A Good First Step

The Initial Solution: The Gene Ontology (GO) – A controlled vocabulary with defined relationships between items.

GO consists of more than 13,000 nodes, or ‘GO Terms’, divided into three main trees: Biological Process, Cellular Component and Molecular Function

Of these, only about 3800 GO Terms are ‘active’ – that is, terms appended with more than just one or two publications.

Page 8: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

8

The Gene Ontology Knowledge Discovery System

• GO KDS) bridges the gap by classifying all of MEDLINE. • New documents are classified as they’re added• Scientists can now annotate gene targets quickly and reliably• GO KDS is updated along with GO and MEDLINE

• Enormous gap between GO-annotated docs (27,000) and full MEDLINE database (12 million entries). • Updates lag behind.• Scientists must understand and agree to use the GO• Knowledge changes and alters definitions.

GO is only a partial solution GO KDS – Filling the gaps in GO

Using GO “as is” takes too

long and delivers too

little

Page 9: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

9

Current GO term(s) open

Location of listed term in

GO

All sub-terms for the listed term: click on a term to further refine

your search

Enter a keyword to search in this GO category

Opens abstract in separate window

Color of stars identifies the GO branch: number of

stars indicates confidence of category placement

Original GO classifications

(by domain-expert)

KDS discovers novel classifications

GO KDS Interface TourGO KDS Interface Tour

Page 10: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

10

GO KDS Key Benefits

Quickly sort documents into most relevant categories to the user

Replace laborious annotation by domain experts with a trainable, automated system

Discover conceptual links between previously unrelated scientific domains

Identify key articles for pertinent research

Integrate public, private and proprietary documents

www.go-kds.com

Page 11: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

11

Drug Approval

Collecting informationOrganizing/Collating documents

Satisfying approval criteria

Life Science Research

Finding relevant literaturePrioritizing articles/reports

Discovering hidden connectionsDistributing information

Patent preparation

Searching patent databasesCollecting relevant documents

Synthesizing information

How is document classification useful?

Page 12: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

12

Intelligent Text Mining: Therapeutic Courses

One Reel Two client is using Classification System to rapidly sort through large volumes of medical documentation in disparate therapeutic areas.

The Problem: Client must generate E-Learning Courses from hundreds of pages of reports, literature and product documentation supplied by client

Old Solution: Manually read through documents to find paragraphs related to ‘Diagnosis’, Etiology, Epidemiology etc.

New Solution: Use Reel Two Classification System to build a custom taxonomy, then automatically classify and extract relevant document sections into Therapeutic Area categories

Page 13: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

13

Intelligent Text Mining – Patent Analysis

Search patent filings for the ideas or concepts behind one’s analysis

– Explore state of prior art, competitive landscape or ‘innovation gaps’

– Overcome intentionally vague language in patent filings

The Mechanism of Action listed for this patent is "Neurotransmitter release modulator." However Classification System identified that this chemical modulator

binds to the acetylcholine receptor, which is the true mechanism of action, and classified this patent in “MoA: Acetylcholinesterase”.

In an in vitro assay, 2-chloro-5-(3-(R)-pyrrolidinylmethoxy)-3-pyridinecarbaldoxime (Ia) exhibited a Ki value for binding to neuronal nicotinic acetylcholine receptors of 0.012 nM.

ACTIVITY - Analgesic; neuroprotective; nootropic; antiparkinsonian; neuroleptic; tranquilizer; antiinflammatory; antidepressant; anabolic; anorectic; anticonvulsant; uropathic; gastrointestinal; antiaddictive; gynecological.

MECHANISM OF ACTION - Neurotransmitter release modulator.

Identifying ‘Mechanism of Action’ in life science patents

Patents are classified according to a taxonomy built by the client:

Alzheimer’s Patents

MoA: 5-HT Inhibitor

MoA: Acetylcholinesterase

MoA: Antioxidant

MoA: Antiviral…

ExamplExample e ProjectProject

Sample Sample OutputOutput

Page 14: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

14

“Life Science Information Management will form the largestunmet need for IT companies in the 21st Century”

Caroline Kovak,General Manager, IBM Life Sciences

Page 15: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

15

1. Search for a particular GO term by opening one of the main branches

Appendix: GO KDS Interface

Page 16: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

16

2. ‘Drill down” through the taxonomy to find a term of interest. Click on that term.

Appendix

Page 17: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

17

3. Select the desired GO term. ‘Open’ the category by clicking on ‘new search with this term.’

Appendix

Page 18: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

18

4. Scroll down to view abstracts.

Appendix

Page 19: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

19

5. Discover conceptual links to other GO categories. Click on the category to add the term to your search.

Appendix

Page 20: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

20

6. View the data intersection between GO categories. Scroll through to view abstract.

Appendix

Page 21: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

21

7. GO terms identify concepts embodied in the abstracts, enabling quick review.

Appendix

Page 22: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

22

8. Select an abstract of interest, and click to open the complete abstract.

Appendix

Page 23: Unstructured Machine Learning: Providing the link between Genetic Data and Published Research

23

9. The abstract will open in a new window, allowing you to continue with your search, or to link directly to the journal.

Appendix