Upload
eithne
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A UMLS- Based System for Literature-Based Discovery in Medicine . Matteo Gabetta. MEDINFO Copenhagen, August 21 st 2013. Literature Based Discovery (LBD). Discover unknown relationships among scientific knowledge. - PowerPoint PPT Presentation
Citation preview
UNIVERSITÀ DI PAVIA
A UMLS-Based Systemfor Literature-Based Discovery
in Medicine
Matteo Gabetta
MEDINFOCopenhagen, August 21st 2013
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery (LBD)
Discover unknown relationships among scientific knowledge
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Literature Based Discovery
Swanson DR: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”. Perspectives in Biology and Medicine 1986, 30(1):7-18.
• Methods of discoveryOPEN vs. CLOSED
• Sources of knowledgeAbstract, Full Text, MeSH, …
• Knowledge representationConcepts, (groups of) words
• Knowledge extractionText mining techniques
• Relationship measurementCitation frequency, association
rules…• Process automation
User interaction level
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics• Methods of discovery
OPEN discovery• Sources of knowledge
Abstract• Knowledge representation
UMLS concepts• Knowledge extraction
Text mining techniques• Relationship measurement
Support/Confidence from association rule theory• Process automation
Highly interactive discovery process
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics
Moreover:• Co-cited UMLS concepts = related
concepts• Semantic Types used for filtering• Literature-Mining Database as a
persistence layer
Technologies:• Java• Entrez Programming Utilities – eUtils• GWT – Google Web Toolkit
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System characteristics
Moreover:• Co-cited UMLS concepts = related
concepts• Semantic Types used for filtering• Literature-Mining Database as a
persistence layer
Technologies:• Java• Entrez Programming Utilities – eUtils• GWT – Google Web Toolkit
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System Workflow
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System Workflow (AB)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System Workflow (BC)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
System Workflow (final)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Support & Confidence
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Support & Confidence
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
The INHERITANCE projectIntegrated Heart Research In Translational Genetics of Cardiomyopathies in
Europe
• Dilated cardiomyopathies• 3 year health research project• European commission funding program 7• 11 European centers
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation
“Re-discover” DCM/gene association
• Only literature prior to 1st explicit DCM/gene association
TNNT2 TPM1 DES LMNATTN MYH7 DMD MVCL
MYBPC3 ABCC9 DSP PLNACTC CLP LDB3 SGCD
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation
“Re-discover” DCM/gene association
• Only literature prior to 1st explicit DCM/gene association
TNNT2 TPM1 DES LMNATTN MYH7 DMD MVCL
MYBPC3 ABCC9 DSP PLNACTC CLP LDB3 SGCD
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: idea
“Re-discover” DCM/gene association
• Only literature prior to 1st explicit DCM/gene association
Angiology. 1975 Nov;26(10):723-33.The differential diagnosis of congestive cardiomyopathyand ischemic cardiomyopathy by echocardiography.Shors CM, et al.
DCM
Nov 1975 time
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: idea
“Re-discover” DCM/gene association
• Only literature prior to 1st explicit DCM/gene association
J Biol Chem. 1982 Apr 25;257(8):4328-32.Oligomeric structure of the major nuclear envelope protein lamin B.Shelton KR, et al.
DCM
Nov 1975
LMNA
Apr 1982 time
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: idea
“Re-discover” DCM/gene association
• Only literature prior to 1st explicit DCM/gene association
N Engl J Med. 1999 Dec 2;341(23):1715-24.Missense mutations in the rod domain of the lamin A/C gene as causes of dilated cardiomyopathy and conduction-system disease.Fatkin D, et al.
DCM
Nov 1975
LMNA
Apr 1982 Dec 1999
LMNA+DCM
time
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: idea
“Re-discover” DCM/gene association
• Only literature prior to 1st explicit DCM/gene associationDCM
Nov 1975
LMNA
Apr 1982 Dec 1999
LMNA+DCM
time
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example• A string : “Dilated cardiomyopathy”
• A concept : “Cardiomyopathy, Dilated –
(C0007193)”
• Query dates : (Apr 1982 – Nov 1999)
• Literature A obtained
• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example• A string : “Dilated cardiomyopathy”
• A concept : “Cardiomyopathy, Dilated –
(C0007193)”
• Query dates : (Apr 1982 – Nov 1999)
• Literature A obtained
• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example• A string : “Dilated cardiomyopathy”
• A concept : “Cardiomyopathy, Dilated –
(C0007193)”
• Query dates : (Apr 1982 – Nov 1999)
• Literature A obtained
• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example• A string : “Dilated cardiomyopathy”
• A concept : “Cardiomyopathy, Dilated –
(C0007193)”
• Query dates : (Apr 1982 – Nov 1999)
• Literature A obtained
• B concepts:o Semantic Type filter (21 types allowed)o Support & Confidence (greater than average)
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example
• Query dates : (Apr 1982 – Nov 1999)
• Literature B obtained
• C concepts:o One Semantic Type: “Gene or Genome –
T028”
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example
• Query dates : (Apr 1982 – Nov 1999)
• Literature B obtained
• C concepts:o One Semantic Type: “Gene or Genome –
T028”
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: an example
• Query dates : (Apr 1982 – Nov 1999)
• Literature B obtained
• C concepts:o One Semantic Type: “Gene or Genome –
T028”
Is LMNA between C concepts?Evaluation of Support and Score
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: resultsGene First date First date
w/ DMCB
concepts#
Papers
TNNT2 1994 May 2000 Jan Not Found 5
TTN 1975 Jan 1994 Oct 64 546
MYBPC3 1993 Feb 1997 Mar Not Found 17
ACTC 1977 Feb 1998 May 98 1313
TPM1 1974 Jan 2000 Jan Not Found 51
MYH7 1989 Feb 2000 Jan Not Found 35
ABCC9 2001 Apr 2004 Apr Not Found 9
CLP 1991 Sep 1997 Feb Not Found 11
DES 1976 Dec 1990 Jan 82 943
DMD 1978 May 1990 Feb 35 290
DSP 1982 Jan 2000 Oct 189 313
LDB3 1993 Jan 2003 Dec Not Found 14
LMNA 1983 Jan 1999 Dec 166 214
MVCL 1985 Jan 1997 Jan Not Found 30
PLN 1975 Jan 1990 May 45 203
SGCD 1999 Aug 1999 Aug Not Available 2
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: resultsGene First date First date
w/ DMCB
concepts#
Papers
TNNT2 1994 May 2000 Jan Not Found 5
TTN 1975 Jan 1994 Oct 64 546
MYBPC3 1993 Feb 1997 Mar Not Found 17
ACTC 1977 Feb 1998 May 98 1313
TPM1 1974 Jan 2000 Jan Not Found 51
MYH7 1989 Feb 2000 Jan Not Found 35
ABCC9 2001 Apr 2004 Apr Not Found 9
CLP 1991 Sep 1997 Feb Not Found 11
DES 1976 Dec 1990 Jan 82 943
DMD 1978 May 1990 Feb 35 290
DSP 1982 Jan 2000 Oct 189 313
LDB3 1993 Jan 2003 Dec Not Found 14
LMNA 1983 Jan 1999 Dec 166 214
MVCL 1985 Jan 1997 Jan Not Found 30
PLN 1975 Jan 1990 May 45 203
SGCD 1999 Aug 1999 Aug Not Available 2
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: resultsGene First date First date
w/ DMCB
concepts#
Papers
TNNT2 1994 May 2000 Jan Not Found 5
TTN 1975 Jan 1994 Oct 64 546
MYBPC3 1993 Feb 1997 Mar Not Found 17
ACTC 1977 Feb 1998 May 98 1313
TPM1 1974 Jan 2000 Jan Not Found 51
MYH7 1989 Feb 2000 Jan Not Found 35
ABCC9 2001 Apr 2004 Apr Not Found 9
CLP 1991 Sep 1997 Feb Not Found 11
DES 1976 Dec 1990 Jan 82 943
DMD 1978 May 1990 Feb 35 290
DSP 1982 Jan 2000 Oct 189 313
LDB3 1993 Jan 2003 Dec Not Found 14
LMNA 1983 Jan 1999 Dec 166 214
MVCL 1985 Jan 1997 Jan Not Found 30
PLN 1975 Jan 1990 May 45 203
SGCD 1999 Aug 1999 Aug Not Available 2
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: resultsGene First date First date
w/ DMCB
concepts#
Papers
TNNT2 1994 May 2000 Jan Not Found 5
TTN 1975 Jan 1994 Oct 64 546
MYBPC3 1993 Feb 1997 Mar Not Found 17
ACTC 1977 Feb 1998 May 98 1313
TPM1 1974 Jan 2000 Jan Not Found 51
MYH7 1989 Feb 2000 Jan Not Found 35
ABCC9 2001 Apr 2004 Apr Not Found 9
CLP 1991 Sep 1997 Feb Not Found 11
DES 1976 Dec 1990 Jan 82 943
DMD 1978 May 1990 Feb 35 290
DSP 1982 Jan 2000 Oct 189 313
LDB3 1993 Jan 2003 Dec Not Found 14
LMNA 1983 Jan 1999 Dec 166 214
MVCL 1985 Jan 1997 Jan Not Found 30
PLN 1975 Jan 1990 May 45 203
SGCD 1999 Aug 1999 Aug Not Available 2
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: results
Gene Score Support Rank Sup Rank Score
TTN 26832 92 68/542 41/542
ACTC 203577 1025 7/662 6/662
DES 21598 150 11/349 8/349
DMD 15268 300 2/349 21/349
DSP 256598 1115 5/887 8/887
LMNA 252739 752 9/822 5/822
PLN 7906 47 69/380 75/380
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: results
Gene Score Support Rank Sup Rank Score
TTN 26832 92 68/542 41/542
ACTC 203577 1025 7/662 6/662
DES 21598 150 11/349 8/349
DMD 15268 300 2/349 21/349
DSP 256598 1115 5/887 8/887
LMNA 252739 752 9/822 5/822
PLN 7906 47 69/380 75/380
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Validation: results
Gene Score Support Rank Sup Rank Score
TTN 26832 92 68/542 41/542
ACTC 203577 1025 7/662 6/662
DES 21598 150 11/349 8/349
DMD 15268 300 2/349 21/349
DSP 256598 1115 5/887 8/887
LMNA 252739 752 9/822 5/822
PLN 7906 47 69/380 75/380
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Discussion and Future Developments
• Effective in ranking DCM related genes• Heuristic score good alternative to Support• Limitation: fails for C concepts with small
literature• Analyze in depth the “threshold problem”• Practical comparison with other systems• Improve effectiveness of Text Mining system
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Discussion and Future Developments
• Effective in ranking DCM related genes• Heuristic score good alternative to Support• Limitation: fails for C concepts with small
literature• Overcome the empirical set-up of some
parameters• Practical comparison with other systems• Improve effectiveness of Text Mining system
Angelo Nuzzo IIT@SEMM, Milan, 2011MEDINFO 2013 - Copenhagen, August 21st 2013Matteo Gabetta
Thank You.
In loving memory ofGilles Belley