87
1 Gene family classification using a semi-supervised learning method Nan Song Advisors: John Lafferty, Dannie Durand

Gene family classification using a semi-supervised learning method

  • Upload
    kaida

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Nan Song Advisors: John Lafferty, Dannie Durand. Gene family classification using a semi-supervised learning method. Outline. Introduction A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning - PowerPoint PPT Presentation

Citation preview

Page 1: Gene family classification using  a semi-supervised learning method

1

Gene family classification using a semi-supervised learning method

Nan SongAdvisors: John Lafferty, Dannie Durand

Page 2: Gene family classification using  a semi-supervised learning method

2

Outline

• Introduction – A motivating application: genome annotation

• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion

Page 3: Gene family classification using  a semi-supervised learning method

The complete genetic material

of an organism or species

The Genome

Page 4: Gene family classification using  a semi-supervised learning method

Key genomic component: genes

ACCCTTAGCTAGACCTTTAGGAGG...

A gene is a DNA subsequence

Page 5: Gene family classification using  a semi-supervised learning method

Key genomic component: genes

Genes encode proteins, the building blocks of the cell

ACCCTTAGCTAGACCTTTAGGAGG...

A gene is a DNA subsequence

A protein is an amino acid sequence

V H L T P E...

Genes encode proteins, the building blocks of the cell

ACCCTTAGCTAGACCTTTAGGAGG...

A gene is a DNA subsequence

A protein is an amino acid sequence

V H L T P E...

Page 6: Gene family classification using  a semi-supervised learning method

6

413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria

In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes

www.genomesonline.org

Whole Genome Sequencing

Page 7: Gene family classification using  a semi-supervised learning method

• atgcaccttg

Page 8: Gene family classification using  a semi-supervised learning method

8

Gene prediction and annotationInternational Human Genome Consortium, Nature 2001

Predicted genes16,896

Total31,778

Known genes14,882

Page 9: Gene family classification using  a semi-supervised learning method

Gene annotation• We are given a new genome sequence with

predicted genes.

• A few genes are well studied.

• Identify other genes in the same family to predict function.

• Verify predictions experimentally

Two contexts: – Individual scientist

– High throughput

Page 10: Gene family classification using  a semi-supervised learning method

10

Outline

• Introduction – Molecular biology– A motivating application: genome

annotation• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion

Page 11: Gene family classification using  a semi-supervised learning method

11

Evolutionarily related genes have related functions

Ancestral geneatgccaggactcccagtga…

atgcgccgtctggcatgt…

β-globin

atgcaaggagtcccagagc…γ-globin

atgcgaggtctcccatgt…

ε-globin

Adult Fetal Embryonic

Duplication

Duplication

Page 12: Gene family classification using  a semi-supervised learning method

Evolutionarily related genes have related functions

Gene family classification is a powerful source of information for inferring evolutionary,

functional and structural properties of genes

atgcgaggtctcccatgt…

Ancestral geneatgccaggactcccagtga…

Duplication

Duplication

atgcgccgtctggcatgt… atgcaaggagtcccagagc…

β-globin γ-globin ε-globin

Page 13: Gene family classification using  a semi-supervised learning method

13

Outline

• Introduction • A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion

Page 14: Gene family classification using  a semi-supervised learning method

14

…atgcaaggagtcccagagcc……atgcgaggtctcccagtgtc…xi

xj

A graphical model of sequence relatedness

E: weight of the edge is proportional to the similarity between sequences.

G = (V,E)

V: represent sequences

Page 15: Gene family classification using  a semi-supervised learning method

15

xi

xj

A graphical model of sequence relatedness

E: weight of the edge is proportional to the similarity between sequences.

G = (V,E)

V: represent sequences

Page 16: Gene family classification using  a semi-supervised learning method

16

xi

xj

Gene family classification

Goal: Given known genes, identify genes in the same family.

Biological scenario: • small number of known genes• large number of unknown genes

Page 17: Gene family classification using  a semi-supervised learning method

17

Outline

• Introduction

• A graphical model of sequence relatedness

• Gene classification using machine learning

• Empirical evaluation

• Conclusion

Page 18: Gene family classification using  a semi-supervised learning method

18

Framework: binary classification

Determine which unlabeled genes belong to the family.

Machine learning scenario: • small number of labeled data

• genes known to be in family• genes clearly not in family

• large number of unlabeled data

Page 19: Gene family classification using  a semi-supervised learning method

19

Several challenging problems of gene family classification

Traditionally, similarity is represented by sequence comparison

atgcgccgtctggcatgt…atgcaaggagtcccagagc…

atgcgaggtctcccatgt…

Ancestral gene

Duplication

Duplication

Mutations

atgcgccccccggcatgt… DNA shuffling

atgcgccgtctggcatgt…ggctcgta

Page 20: Gene family classification using  a semi-supervised learning method

20

Several challenging problems of gene family classification

Traditionally, similarity is represented by sequence comparison

atgcgccgtctggcatgt…atgcaaggagtcccagagc…

atgcgaggtctcccatgt…

Ancestral gene

Duplication

Duplication

Mutations

atgcgccccccggcatgt… DNA shuffling

atgcgccgtctggcatgt…ggctcgta

Page 21: Gene family classification using  a semi-supervised learning method

21

Several challenging problems of gene family classification

Families

– do not form a clique– do not form a connected

component– have edges to sequences outside

the family.

Page 22: Gene family classification using  a semi-supervised learning method

22

Outline

• Introduction

• A graphical model of sequence relatedness

• Gene classification using machine learning– Semi-supervised learning algorithm– Supervised learning algorithm

• Empirical evaluation

• Conclusion

Page 23: Gene family classification using  a semi-supervised learning method

23

Gene family classification

Goal: Binary classification

Machine learning scenario: • large number of unlabeled data• small number of labeled data

Semi supervised learning:• Exploit information from both labeled and unlabeled data

• Performed well in many applications

Page 24: Gene family classification using  a semi-supervised learning method

24

Graphical semi-supervised learning (Binary classification)

Notation:

• V: The whole data set

• L: Labeled data set

• U: unlabeled data set

• Each vertex: (xi,yi) or (xk, f(k))

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

(xi,yi = 1)

(xj,yj = 0)(xk,f(k))

Page 25: Gene family classification using  a semi-supervised learning method

25

Graphical semi-supervised learning (Binary classification)

(xi,yi = 1)

(xj,yj = 0)(xk,f(k))

• Output: – Assign a real value to every

vertex in the graph– Find a cutoff to separate the

two classes

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

• Input: – family members (xi, yi = 1) – nonfamily members: (xj, yj = 0)

Page 26: Gene family classification using  a semi-supervised learning method

26

Graphical semi-supervised learning (Binary classification)

(xi,yi = 0)

G = (V,E)L: Labeled data setU: unlabeled data set

(xn,yp = 1)(xk,f(k))

Assign real values to all vertices in the graph, to minimize E(f):

)( exp where

)( ..

)))()((()( 2

ij

ij

ii

ijVjVi

SW

Lxyifts

jfifWfE

Sij

Page 27: Gene family classification using  a semi-supervised learning method

27

Graph-based semi-supervised learning

f(xk)

http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

Works well

Page 28: Gene family classification using  a semi-supervised learning method

28

Graph-based semi-supervised learning

f(xk)

http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

Works well Works well ?

Page 29: Gene family classification using  a semi-supervised learning method

29

Outline

• Introduction

• A graphical model of sequence relatedness

• Gene classification using machine learning– Semi-supervised learning– Supervised learning

• Empirical evaluation

• Conclusion

Page 30: Gene family classification using  a semi-supervised learning method

Semi-supervised vs kernel-based supervised learning

• Semi-supervised learning:

• Supervised learning:

Lxyifts

jfifWfE

ii

ijVjVi

)( ..

)))()((()( 2

)))()((()( 2jfifWfE ijUjLi

where L is the labeled data set and U is the unlabeled data set

Page 31: Gene family classification using  a semi-supervised learning method

31

Outline

• Introduction

• A graphical model of sequence relatedness

• Gene classification using machine learning

• Empirical evaluation– Methodology– Results

• Conclusion

Page 32: Gene family classification using  a semi-supervised learning method

32

Graph construction

G = (V,E)

V: All mouse sequences from SwissProt (n = 7439)

E: based on newly designed sequence similarity measurement.

0 < S(i, j) < 1

Page 33: Gene family classification using  a semi-supervised learning method

33

Methodology• Graph construction

• Test set construction

• Experiments performed

• Basis for evaluation

Page 34: Gene family classification using  a semi-supervised learning method

Test set construction

18 well studied protein families– Receptors, enzymes, transcription factors,

motor proteins, structural proteins, and extracellular matrix proteins.

ACSL FOX Laminin

PDE TRAF

ADAM GATA

SEMA

T-box

DVL Kinase

Myosin

USP

FGF Kinesin

Notch TNFR

WNT

Page 35: Gene family classification using  a semi-supervised learning method

35

Test set construction

• Retrieved all complete mouse sequences from SwissProt database (7,439)

• Identified sequences for each test family based on

– Nomenclature committee reports

– Structural properties

– Literature surveys

Family size ACSL 5 ADAM 26 DVL 3 FGF 20 FOX 30 GATA 6 Kinase 293 Kinesin 18 Laminin 11 Myosin 12 Notch 4 PDE 15 SEMA 16 TNFR 24 TRAF 9 Tbox 6 USP 18 WNT 19

Page 36: Gene family classification using  a semi-supervised learning method

36

Methodology• Graph construction

• Test set construction

• Experiments performed

• Basis for evaluation

Page 37: Gene family classification using  a semi-supervised learning method

Experiments performed

• Compare semi-supervised with supervised learning algorithm

• Tested parameters:– Scaling parameter,σ, in the kernel function

– Number of Labeled Family members (LF)

– Number of Labeled Nonfamily members(LN)

Page 38: Gene family classification using  a semi-supervised learning method

Tested parameters

number of Labeled Family members

number of Non-labeled Family members

σ

For each set of parameters, 20 tests were performed

Page 39: Gene family classification using  a semi-supervised learning method

Tested parameters (1)

Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100

S

W

0.02

σ=100

σ=10

σ=1

σ=0.5

σ=0.2σ=0.1

0.080.05

10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

)(exp where),))()((()( 2

ij

ijijVjVi

SWjfifWfE

Page 40: Gene family classification using  a semi-supervised learning method

Tested parameters (2)

• Labeled Family members (LF):

10-70% of family size • Labeled Nonfamily members (LN) :

100, 500, 1000

about 1 - 10% of nonfamily size

Family size

LF

ACSL 5 1, 3 ADAM 26 3, 5, 7, 9,15 DVL 3 1 FGF 20 3, 5, 7, 11, 15 FOX 30 3, 9, 15 GATA 6 1,3 Kinase 293 3, 7, 11, 15,

20, 50, 150 Kinesin 18 2, 6, 9 Laminin 11 1, 3, 5, 7 Myosin 12 2, 4, 6, 9 Notch 4 1, 2 PDE 15 2, 5, 7, 10 SEMA 16 2, 5, 8 TNFR 24 2, 4, 8, 12, 18 TRAF 9 1, 3 Tbox 6 2, 5 USP 18 2, 4, 6, 9,13 WNT 19 2, 9

Database size: 7439

Page 41: Gene family classification using  a semi-supervised learning method

41

Methodology• Graph construction

• Test set construction

• Experiments performed

• Basis for evaluation

Page 42: Gene family classification using  a semi-supervised learning method

42

Semi-supervised learning

Goal:

f(i) > f(j) when xi is a family member and xj is not.

Evaluation criteria:

• Visualization

• AUC score

• False negatives

Page 43: Gene family classification using  a semi-supervised learning method

VisualizationSort all unlabeled data by f(x)

f(x)

Rank

Family members

Nonfamily members

Page 44: Gene family classification using  a semi-supervised learning method

1 - specificity

sens

itivi

ty

11 1

ba

n

i

n

jba

nnAUC

a b

ji

f(x)

Rank

Family members

Nonfamily members

AUC (Area Under ROC Curve)

Rank plot

Page 45: Gene family classification using  a semi-supervised learning method

Advantages of rank plot

AUC = 0.9382

Page 46: Gene family classification using  a semi-supervised learning method

AUC scores do not reflect all information we need

• False negatives after the first false positive

• The number of missed data after the first false positive

Page 47: Gene family classification using  a semi-supervised learning method

47

Outline

• Introduction

• A graphical model of sequence relatedness

• Gene classification using machine learning

• Empirical evaluation– Methodology– Results

• Conclusion

Page 48: Gene family classification using  a semi-supervised learning method

48

Several challenging problems of gene family classification

Families

– do not form a clique– do not form a connected

component– have edges to sequences outside

the family.

Edges to sequences outside the family are mainly a problem if they have strong edge weights

Page 49: Gene family classification using  a semi-supervised learning method

49

Test families have different graph properties Family

size Clique Connected NOT

Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

Page 50: Gene family classification using  a semi-supervised learning method

Results

• Compare semi-supervised with supervised learning algorithm

• Tested parameters:– Scaling parameter,σ, in the kernel function

– Number of Labeled Family members (LF)

– Number of Labeled Nonfamily members(LN)

Page 51: Gene family classification using  a semi-supervised learning method

Tested parameters

number of Labeled Family members

number of Non-labeled Family members

σ

0.9996

0.9998

1

Notch, Lf = 1, Ln =1000

10.1 0.5 100.2

AU

C (

ave)

Page 52: Gene family classification using  a semi-supervised learning method

The effect of σ

Raw similarity score (s)

W

0.02

σ=100

σ=10

σ=1

σ=0.5

σ=0.2σ=0.1

0.080.05

10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

)(exp where),))()((()( 2

ij

ijijVjVi

SWjfifWfE

Page 53: Gene family classification using  a semi-supervised learning method

53

Test families have different graph properties Family

size Clique Connected NOT

Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

Page 54: Gene family classification using  a semi-supervised learning method

Edges to sequences outside the family are mainly a problem if they have strong edge weights

Page 55: Gene family classification using  a semi-supervised learning method

Edges to sequences outside the family are mainly a problem if they have strong edge weights

FOX Notch

Nu

mb

er o

f ed

ges

Raw edge weight Raw edge weight

Page 56: Gene family classification using  a semi-supervised learning method

Case study: Rank plots for semi-supervised learning in FOX

σ = 0.1σ =1

σ = 10σ=100

LF = 3, LN = 100, family size: 30

Page 57: Gene family classification using  a semi-supervised learning method

Case study: rank plots for semi-supervised learning in Notch

labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435)

σ = 0.1

σ = 1

σ = 10

σ= 10

Page 58: Gene family classification using  a semi-supervised learning method

0.9996

0.9998

1

Notch, Lf = 1, Ln =1000

10.1 0.5 100.2

AU

C (

ave)

0.9996

0.9998

1

10.1 0.5 100.2

AU

C (

ave)

FOX, Lf = 3, Ln =1000

σ

Page 59: Gene family classification using  a semi-supervised learning method

Summary on σ

• For most families, the performance is not very sensitive to σ

• For almost all families that form a clique, there is at least one value of sigma (usually many)– such that both semi-supervised and supervised

learning algorithms have perfect classfication performance

Page 60: Gene family classification using  a semi-supervised learning method

Results

• Compare semi-supervised with supervised learning algorithm

• Tested parameters:– Scaling parameter,σ, in the kernel function

– Number of Labeled Family members (LF)

– Number of Labeled Nonfamily members(LN)

Page 61: Gene family classification using  a semi-supervised learning method

61

Test families have different graph properties Family

size Clique Connected NOT

Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

Page 62: Gene family classification using  a semi-supervised learning method

The connection among sequences in ADAM family

0

2

4

6

8

10

12

14

16

# of connected ADAM sequences269 24 25

Page 63: Gene family classification using  a semi-supervised learning method

The connection among sequences in ADAM family

Page 64: Gene family classification using  a semi-supervised learning method

Tested parameters

number of Labeled Family members

number of Non-labeled Family members

σ

By taking the maximum

number of Labeled Family members

number of Non-labeled Family members

achieve the best average AUC score

Page 65: Gene family classification using  a semi-supervised learning method

The impact of number of labeled family and nonfamily members on the performance

0.992

0.996

1

73 5 159

AU

C

Supervised, LN =100

# labeled family seqs, LF

ADAM

Page 66: Gene family classification using  a semi-supervised learning method

The impact of number of labeled family and nonfamily members on the performance

0.992

0.996

1

73 5 159

AU

C

Supervised, LN =100

# labeled family seqs, LF

Semi-supervised, LN = 100

ADAM

Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters

Page 67: Gene family classification using  a semi-supervised learning method

The impact of number of labeled family and nonfamily members on the performance

0.992

0.996

1

73 5 159

AU

C

Supervised, ln =100

# labeled family seqs

Supervised, ln =1000

Semi-supervised, ln = 100

ADAM

Page 68: Gene family classification using  a semi-supervised learning method

The impact of number of labeled family and nonfamily members on the performance

0.992

0.996

1

73 5 159

AU

C

# labeled family seqs

Semi-supervised, ln = 1000

Supervised, ln =100

Supervised, ln =1000

Semi-supervised, ln = 100

ADAM

Page 69: Gene family classification using  a semi-supervised learning method

Graph structure of ADAM

• Troublemaker: ADAMTS10 matches with only 8

out of 26 sequences in ADAM family.

• ADAMTS10 is often misclassified

• ADAMTS10 is implicated in a genetic disease

that causes impaired vision and heat defects.

Page 70: Gene family classification using  a semi-supervised learning method

70

Semi-supervised method

Supervised method

Page 71: Gene family classification using  a semi-supervised learning method

71

Several challenging problems of gene family classification

• Sequences in the same family – do not form a clique– do not exist in the same connected component

• Sequences in different families – have significant matches

Page 72: Gene family classification using  a semi-supervised learning method

72

Test families have different graph properties Family

size Clique Connected NOT

Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

Page 73: Gene family classification using  a semi-supervised learning method

The connection among sequences in TNFR family

Page 74: Gene family classification using  a semi-supervised learning method

The connection among sequences in TNFR family

10 11 12 13 14 15 16 17 18 19 20

6

4

2

# of connected TNFR sequences

20 TNFR in this connected component

Page 75: Gene family classification using  a semi-supervised learning method

0.92

0.96

1

TNFR (family size 24)

82 4 1812

Semi, ln = 1000

Supervised, ln =100

Supervised, ln=1000Semi, , ln = 100

AU

CThe impact of number of labeled family and

nonfamily members on the performance

Page 76: Gene family classification using  a semi-supervised learning method

Summary for Number of labeled family members

• The performance of both semi-supervised

and supervised learning improves as LF

increases for all families.

• In non-clique families, semi-supervised

learning performs better than supervised

when LF is small.

Page 77: Gene family classification using  a semi-supervised learning method

Rank plots for semi-supervised learning in TNFR

σ= 0.1 Lf = 2, ln = 100

AUC values do not reflect all information that we need

Page 78: Gene family classification using  a semi-supervised learning method

TNFR (family size 24)

82 4 1812

Semi, ln = 1000

Supervised, ln =100

Supervised, ln=1000

Semi, , ln = 100

Nu

mb

er o

f m

isse

d T

NF

RThe impact of number of labeled family and

nonfamily members on the performance

Page 79: Gene family classification using  a semi-supervised learning method

Summary for Number of labeled family members

• The performance of both semi-supervised

and supervised learning improves as LF

increases for all families.

• In non-clique families, semi-supervised

learning performs better than supervised

when LF is small.

Page 80: Gene family classification using  a semi-supervised learning method

Summary for Number of labeled non-family members (LN)

• The performance supervised learning

improves as LN increases for all families.

• For semi-supervised learning, sometimes

LN is sometimes helpful and sometimes

not.

Page 81: Gene family classification using  a semi-supervised learning method

81

Summary of results

Clique

Connected

Small LF Large LF

100 1000 100 1000 100 1000 100 1000

Super Semi Super semi Super Semi Super semi ACSL DVL FOX GATA Kinesin Myosin Laminin 0.9999 0.9999 0.9999 0.9999 1 1 1 1 Notch SEMA TRAF Tbox WNT ADAM 0.9951 1 0.9989 1 1 1 1 1 FGF Kinase 0.9549 0.9644 0.9745 0.9738 0.9656 0.9666 0.9804 0.9771 PDE 0.9181 0.9364 0.9644 0.9589 0.9612 0.9603 0.9850 0.9769 TNFR 0.9297 0.9420 0.9537 0.9526 0.9628 0.9671 0.9845 0.9866 USP 0.9792 0.9798 0.9907 0.9895 0.9827 0.9875 0.9900 0.9895

Page 82: Gene family classification using  a semi-supervised learning method

Insights - 1

• SSL is most effective for families that are not cliques but are connected.

• In test set, 12/18 cliques, 3/18 not connected.

• What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?

Page 83: Gene family classification using  a semi-supervised learning method

Insights - 2

• Performance evaluation measures should match the needs of the user.

– AUC scores penalize all FNs and FPs.

– For experimental biologists, top ranked predictions are of interest

– The number of FNs after the first false positive can reveal some information

Page 84: Gene family classification using  a semi-supervised learning method

Insights - 3

• Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small

Page 85: Gene family classification using  a semi-supervised learning method

Acknowledgements

• John Lafferty

• Dannie Durand

• Jerry Zhu

Durand Lab• Robbie Sedgewick

• Rose Hoberman

• Ben Vernot

• Narayanan Raghupathy

• Aiton Goldman

• Jacob Joseph

• Annette McLeod

• Maureen Stolzer

Page 86: Gene family classification using  a semi-supervised learning method

Thank You !

Page 87: Gene family classification using  a semi-supervised learning method