52
Mouse-Human Research Classifier Presented By: Osama Jomaa Research Adviser: Dr. Iddo Friedberg

Mouse-Human Research Classifier

Embed Size (px)

Citation preview

Mouse­Human Research Classifier

Presented By: Osama Jomaa

Research Adviser: Dr. Iddo Friedberg

Mouse Models in Research

Mouse Models in Research

Shares 99% of its genome with humans

Mouse Models in Research

Shares 99% of its genome with humans

Fewer ethical concerns than other

mammal models

Mouse Models in Research

InexpensiveShares 99% of its genome with humans

Fewer ethical concerns than other

mammal modelsShort generation

times

Small

The Mouse Trap. The Danger of Using one Lab Animal to Study Every Disease. Daniel Engber http:http://www.slate.com/articles/health_and_science/the_mouse_trap/2011/11/lab_mice_are_they_limiting_our_understanding_of_human_disease_.html. November 16, 2011

Designer Mice for Human Research

Photo taken from “Designer mice for human disease - A close view of Nobel Laureate : Oliver Smithies” Yau-Sheng Tsai, Pei-Jane Tsai, Man-Jin Jiang, Cherng-Shyang Chang. http://proj.ncku.edu.tw/research/commentary/e/20071116/2.html December 9, 2014

Mouse Model is Not Perfect Though

Photo taken from: The Mouse Trap. The Danger of Using one Lab Animal to Study Every Disease. Daniel Engber http:http://www.slate.com/articles/health_and_science/the_mouse_trap/2011/11/lab_mice_are_they_limiting_our_understanding_of_human_disease_.html. November 16, 2011

Mouse Correlation with Human to Equivalent Diseases

Photo taken from “Genomic responses in mouse models poorly mimic human inflammatory diseases.” Seok, Warren, and Others. Proceedings of the National Academy of Sciences. 110, no. 9 (2013): 3507-3512.

Rank correlation (R2)

Percentage of genes changed in the same direction

Proposed Research

Classify the Mouse-Human scientific literature in PubMed into different areas of research

Citation Networks + MeSH Thesaurus

Identify and study the popular areas of Mouse-Human research

What?

How?

Why?

Proposed Research

Classify the proteins in the Mouse-Human citation pairs into different biological systems

Protein Co-occurrence Networks + Gene Ontology

Investigate the biological systems andproteins for which Mouse is used as a model organism for Human

What?

How?

Why?

Agenda

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Protein and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

Getting Mouse and Human PubMed IDs

UniprotGOA

Mouse PubMed Identifiers (PMIDs)

Human PubMed Identifiers (PMIDs)

1. Get Mouse & Human papers from Uniprot

Getting Mouse and Human PubMed IDs

UniprotGOA

Mouse PubMed Identifiers (PMIDs)

Human PubMed Identifiers (PMIDs)

1. Get Mouse & Human papers from Uniprot

2. Query PubMed API for the citation list for each article

Getting Mouse and Human PubMed IDs

UniprotGOA

Mouse PubMed Identifiers (PMIDs)

Human PubMed Identifiers (PMIDs)

1. Get Mouse & Human papers from Uniprot

2. Query PubMed API for the citation list for each article

.

.<CitationList>

<PMID> 342342 </PMID><PMID> 423545 </PMID><PMID> 432598 </PMID>

</CitationList>..

3. Parse PubMed XML response and get the citation list

Getting Mouse and Human PubMed IDs

UniprotGOA

Mouse PubMed Identifiers (PMIDs)

Human PubMed Identifiers (PMIDs)

1. Get Mouse & Human papers from Uniprot

2. Query PubMed API for the citation list for each article

.

.<CitationList>

<PMID> 342342 </PMID><PMID> 423545 </PMID><PMID> 432598 </PMID>

</CitationList>..

3. Parse PubMed XML response and get the citation list

Very few PubMed articles have the citation list in their XML file!

Getting Mouse and Human Citation List from Scopus

UniprotGOA

Mouse PubMed Identifiers (PMIDs)

Human PubMed Identifiers (PMIDs)

1. Get Mouse & Human papers from Uniprot

2. Author HTTP GET request with PMIDS

3. Parse Scopus JSON response and get the citation list

.

.{CitationList: {PMID: 342342}, {PMID: 423545}, {PMID: 432598}}

.

.

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

Building the Citation Network

H

M

M

H

H

H

H

M

H

H

H

M

H

HH

H

H

H

M

H

M

M

H

H

H

H

Building the Citation Network

H

M

M

H

H

H

H

M

H

H

H

M

H

HH

H

H

H

M

H

M

M

H

H

H

H

M → HH → H

H → M

M → M

Building the Citation Network

H

M

M

H

H

H

H

M

H

H

H

M

H

HH

H

H

H

M

H

M

M

H

H

H

H

M → HH → H

H → M

M → M

62%3%

34%

Mouse Inter and Intra Citations

Mouse-Human Citations Mouse-Mouse Citations

Moue-Others Citations

34%

62%

4%

Human Inter and Intra Citations

Human-Others Citations Human-Human Citations

Human-Mouse Citations

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

Medical Subject Headings

Controlled vocabulary to index PubMed articles

Stored in a DAG-like structure

16 top level concepts at the root

Includes ~27K concepts (MeSH descriptors) all together

Medical Subject Headings

Controlled vocabulary to index PubMed articles

Stored in a DAG-like structure

16 top level concepts at the root

Includes ~27K concepts (MeSH descriptors) all together

We used MeSH to group the Mouse and Human papers in the citation network

into classes of research

MeSH Structure Example

Digestive System Diseases

Gastrointestinal DiseasesDigestive System Neoplasms

Neoplasms by Site

Neoplasms

Stomach DiseasesGastrointestinal Neoplasms

Stomach Neoplasms

Classifying the Citation Network

H

M

M

H

H

H

M

H

H

H

M

H

HH

H

H

H

M

H

M

M

H H

H

To Do: Place in research areas

H

M

M

H

H

H

M

H

H

H

M

H

HH

H

H

H

M

H

M

M

H H

H Digestive System Diseases

Eye Diseases

Virus Diseases

Immune System

Diseases

Cardiovascular DiseasesSkinDiseases

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

Number of Mouse and Human Papers in the MeSH Disease Categories

Number of Mouse-Human Citation Pairs in the MeSH Disease Categories

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

GenBank

Protein: NP_e342 | PMID: 432432kicgdkssgihygvitcegckgffrrsqqcProtein: NP_452u1 | PMID: 483232Adtltytlglsdgqlplgaspdlpeasacp…..

1. Get the protein sequences Human and papers

GenBank

Protein: NP_e342 | PMID: 432432kicgdkssgihygvitcegckgffrrsqqcProtein: NP_452u1 | PMID: 483232Adtltytlglsdgqlplgaspdlpeasacp…..

1. Get the protein sequences Human and papers

...

PMID: 3213414NP_u4323: sgihygvitcegckgffrrsqqcNP_i4322: lplgaspdlpeasacfewrwts NP_w3421: kicgdkssgihygvitceg

PMID: 2346414NP_ti3423: vitcegckgckgffrrsqqcNP_q4322f: ygvitcegeasacfewrwtsNP_x342u2: kicgdkssgihygvitceg

2. Group the proteins by their PMID

GenBank

Protein: NP_e342 | PMID: 432432kicgdkssgihygvitcegckgffrrsqqcProtein: NP_452u1 | PMID: 483232Adtltytlglsdgqlplgaspdlpeasacp…..

1. Get the protein sequences Human and papers

...

PMID: 3213414NP_u4323: sgihygvitcegckgffrrsqqcNP_i4322: lplgaspdlpeasacfewrwts NP_w3421: kicgdkssgihygvitceg

PMID: 2346414NP_ti3423: vitcegckgckgffrrsqqcNP_q4322f: ygvitcegeasacfewrwtsNP_x342u2: kicgdkssgihygvitceg

NP_u4323: sgihygvitcegckgffrrsqqc

NP_i4322: lplgaspdlpeasacfewrwts

NP_w3421: kicgdkssgihygvitceg

NP_ti3423: vitcegckgckgffrrsqqc

NP_q4322f: ygvitcegeasacfewrwts

NP_x342u2: kicgdkssgihygvitceg

NP_w3421: kicgdkssgihygvitceg

NP_ti3423: vitcegckgckgffrrsqqc

2. Group the proteins by their PMID

3. Intersect the Genbank papers with Scopus citations

NP_u4323: sgihygvitcegckgffrrsqqc

NP_i4322: lplgaspdlpeasacfewrwts

NP_w3421: kicgdkssgihygvitceg

NP_ti3423: vitcegckgckgffrrsqqc

NP_q4322f: ygvitcegeasacfewrwts

NP_x342u2: kicgdkssgihygvitceg

NP_w3421: kicgdkssgihygvitceg

NP_ti3423: vitcegckgckgffrrsqqc

NP_u4323: sgihygvitcegckgffrrsqqcNP_i4322: lplgaspdlpeasacfewrwts NP_w3421: kicgdkssgihygvitcegNP_ti3423: vitcegckgckgffrrsqqcNP_q4322f: ygvitcegeasacfewrwtsNP_x342u2: kicgdkssgihygvitceg

Removing Redundancies

Use CD-HIT with similarity threshold = 0.9

Gene Ontology

Photo taken from: Gene Ontology Consortium. Ontology Structure. http://geneontology.org/page/ontology-structure Last access December 13, 2014

Gene Ontology Annotation

Biological Process

Cellular Component

Molecular Function

cytochrome c

mitochondrial matrix

oxidoreductase activityoxidative phosphorylation

NP_u4323: sgihygvitcegckgffrrsqqcNP_i4322: lplgaspdlpeasacfewrwts NP_w3421: kicgdkssgihygvitcegNP_ti3423: vitcegckgckgffrrsqqcNP_q4322f: ygvitcegeasacfewrwtsNP_x342u2: kicgdkssgihygvitceg

FASTA FileBLAST

DB

1. Create BLAST query in FASTA format

2. Create BLAST Database from Swissprot Human Flat File

Getting GO Terms

NP_u4323: sgihygvitcegckgffrrsqqcNP_i4322: lplgaspdlpeasacfewrwts NP_w3421: kicgdkssgihygvitcegNP_ti3423: vitcegckgckgffrrsqqcNP_q4322f: ygvitcegeasacfewrwtsNP_x342u2: kicgdkssgihygvitceg

FASTA FileBLAST

DB

NP_u4323: GO1, GO5, GO4NP_i4322: GO5, GO9NP_w3421: GO4, GO6...

1. Create BLAST query in FASTA format

2. Create BLAST Database from Swissprot Human Flat File

3. Do BLAST with e-value = 10-8

4. Parse the BLAST XML response and get the GO terms for the top hits

Getting GO Terms

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Cited Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

MP

MP

MP

HP

HP

1

12

6

MP

MP

MP

HP

HP

10

5

20

14

1

19

12

18

24

7

84

6

MP

MP

MP

HP

HP

Citation Edge

P-P Edge

P-C-P Edge

Building the PCoC Network

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

To Do: Classifying the PCoC Network

To Do: Place in Protein Biological Systems

lactase activity

serotonin Receptor activity

signal sequence binding

signal transducer activitynucleotide

binding

ATP binding

1. PubMed Articles Classification

1. Collect Mouse and Human Papers

2. Build a Citation Network

3. Classify the Cit-Net Using MeSH Thesaurus

4. Stats Study on MeSH Disease Classification

2. PubMed Proteins Analysis

1. Collect Human Proteins and Annotation Data

2. Build the Entity Co-occurrence Networks

3. Classify PCoC Networks Using Gene Ontology

3. Summary

Summary

Cit-Net connects citing Mouse papers with cited Human

papers in the PubMed database

MeSH is used to classify the citation network nodes into

different classes of research

PCoC network connects the proteins in the citing Mouse

papers with proteins in the cited Human papers

GO is used to group the P-P and P-C-P network nodes

into different classes of MFs, BPs and Ccs

Timetable

Jan Feb Mar Apr May

Database Creation and Data migration

Citation Network Classification

PCoC Networks Building

PCoC Networks Classification

PCoC Networks Analysis

Thank You!Q & A