32
Practical Bioinformatics Mark Voorhies 4/6/2017 Mark Voorhies Practical Bioinformatics

Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Practical Bioinformatics

Mark Voorhies

4/6/2017

Mark Voorhies Practical Bioinformatics

Page 2: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Loading and re-loading your functions

# Use i m p o r t th e f i r s t t ime you l o a d a module# ( And keep u s i n g i m p o r t u n t i l i t l o a d s# s u c c e s s f u l l y )import my module

my module . m y f u n c t i o n ( 4 2 )

# Once a module has been loaded , use r e l o a d to# f o r c e python to r e a d your new codefrom i m p o r t l i b import r e l o a dr e l o a d ( my module )

Mark Voorhies Practical Bioinformatics

Page 3: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Pearson distances

Pearson similarity

s(x , y) =

∑Ni (xi − xoffset)(yi − yoffset)√∑N

i (xi − xoffset)2√∑N

i (yi − yoffset)2

Pearson distanced(x , y) = 1 − s(x , y)

Euclidean distance ∑Ni (xi − yi )

2

N

Mark Voorhies Practical Bioinformatics

Page 4: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Pearson distances

Pearson similarity

s(x , y) =

∑Ni (xi − xoffset)(yi − yoffset)√∑N

i (xi − xoffset)2√∑N

i (yi − yoffset)2

Pearson distanced(x , y) = 1 − s(x , y)

Euclidean distance ∑Ni (xi − yi )

2

N

Mark Voorhies Practical Bioinformatics

Page 5: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Pearson distances

Pearson similarity

s(x , y) =

∑Ni (xi − xoffset)(yi − yoffset)√∑N

i (xi − xoffset)2√∑N

i (yi − yoffset)2

Pearson distanced(x , y) = 1 − s(x , y)

Euclidean distance ∑Ni (xi − yi )

2

N

Mark Voorhies Practical Bioinformatics

Page 6: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Comparing all measurements for two genes

●●

●●●

−5 0 5

−5

05

Comparing two expression profiles (r = 0.97)

TLC1 log2 relative expression

YF

G1

log2

rel

ativ

e ex

pres

sion

Mark Voorhies Practical Bioinformatics

Page 7: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Comparing all genes for two measurements

●●

● ●

−10 −5 0 5 10

−10

−5

05

Array 1, log2 relative expression

Arr

ay 2

, log

2 re

lativ

e ex

pres

sion

Mark Voorhies Practical Bioinformatics

Page 8: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Comparing all genes for two measurements

●●

● ●

−10 −5 0 5 10

−10

−5

05

Euclidean Distance

Array 1, log2 relative expression

Arr

ay 2

, log

2 re

lativ

e ex

pres

sion

Mark Voorhies Practical Bioinformatics

Page 9: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Comparing all genes for two measurements

●●

● ●

−10 −5 0 5 10

−10

−5

05

Uncentered Pearson

Array 1, log2 relative expression

Arr

ay 2

, log

2 re

lativ

e ex

pres

sion

Mark Voorhies Practical Bioinformatics

Page 10: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Measure all pairwise distances under distance metric

Mark Voorhies Practical Bioinformatics

Page 11: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

Page 12: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

Page 13: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

Page 14: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

Page 15: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

Page 16: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

It’s hard work at times, but you have to be realistic. If you have alarge database with many variables and your goal is to get a goodunderstanding of the interrelationships, then, unless you get lucky,

this complex structure is bound to require some hard work tounderstand.

Bill Cleveland and Rick Beckerhttp://stat.bell-labs.com/project/trellis/interview.html

Mark Voorhies Practical Bioinformatics

Page 17: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Using JavaTreeView

Mark Voorhies Practical Bioinformatics

Page 18: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Adjust pixel settings for global view

Mark Voorhies Practical Bioinformatics

Page 19: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Adjust pixel settings for global view

Mark Voorhies Practical Bioinformatics

Page 20: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Select annotation columns

Mark Voorhies Practical Bioinformatics

Page 21: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Select annotation columns

Mark Voorhies Practical Bioinformatics

Page 22: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Select URL for gene annotations

Mark Voorhies Practical Bioinformatics

Page 23: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Select URL for gene annotations

Mark Voorhies Practical Bioinformatics

Page 24: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Activate and detach annotation window

Mark Voorhies Practical Bioinformatics

Page 25: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Activate and detach annotation window

Mark Voorhies Practical Bioinformatics

Page 26: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Activate and detach annotation window

Mark Voorhies Practical Bioinformatics

Page 27: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of theEisen paper (removing correlations among genes and/or arrays).

def s h u f f l e G e n e s ( s e l f , s eed = None ) :””” S h u f f l e e x p r e s s i o n mat r i x by row . ”””import randomi f ( seed != None ) :

random . seed ( seed )i n d i c e s = range ( l e n ( s e l f . genes ) )random . s h u f f l e ( i n d i c e s )genes = [ s e l f . geneName [ i ] f o r i i n i n d i c e s ]s e l f . geneName = genesa nno t a t i o n s = [ s e l f . geneAnn [ i ] f o r i i n i n d i c e s ]s e l f . geneAnn = genesnum = [ s e l f . num [ i ] f o r i i n i n d i c e s ]s e l f . num = num

Mark Voorhies Practical Bioinformatics

Page 28: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of theEisen paper (removing correlations among genes and/or arrays).

def s h u f f l e G e n e s ( s e l f , s eed = None ) :””” S h u f f l e e x p r e s s i o n mat r i x by row . ”””import randomi f ( seed != None ) :

random . seed ( seed )i n d i c e s = range ( l e n ( s e l f . genes ) )random . s h u f f l e ( i n d i c e s )genes = [ s e l f . geneName [ i ] f o r i i n i n d i c e s ]s e l f . geneName = genesa nno t a t i o n s = [ s e l f . geneAnn [ i ] f o r i i n i n d i c e s ]s e l f . geneAnn = genesnum = [ s e l f . num [ i ] f o r i i n i n d i c e s ]s e l f . num = num

Mark Voorhies Practical Bioinformatics

Page 29: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of theEisen paper (removing correlations among genes and/or arrays).

def s hu f f l eRows ( s e l f , s eed = None ) :”””Permute r a t i o v a l u e s w i t h i n rows . ”””import randomi f ( seed != None ) :

random . seed ( seed )f o r i i n s e l f . num :

random . s h u f f l e ( i )

def s h u f f l e C o l s ( s e l f , s eed = None ) :”””Permute r a t i o v a l u e s w i t h i n columns . ”””import randomi f ( seed != None ) :

random . seed ( seed )# Transpose t h e e x p r e s s i o n m a t r i xc o l s = [ ]f o r c o l i n x range ( l e n ( s e l f . num [ 0 ] ) ) :

c o l s . append ( [ row [ c o l ] f o r row i n s e l f . num ] )# S h u f f l ef o r i i n c o l s :

random . s h u f f l e ( i )# Transpose back to o r i g i n a l o r i e n t a t i o ns e l f . num = [ ]f o r row i n x range ( l e n ( c o l s ) ) :

s e l f . num . append ( [ c o l [ row ] f o r c o l i n row ] )

Mark Voorhies Practical Bioinformatics

Page 30: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of theEisen paper (removing correlations among genes and/or arrays).

def s hu f f l eRows ( s e l f , s eed = None ) :”””Permute r a t i o v a l u e s w i t h i n rows . ”””import randomi f ( seed != None ) :

random . seed ( seed )f o r i i n s e l f . num :

random . s h u f f l e ( i )

def s h u f f l e C o l s ( s e l f , s eed = None ) :”””Permute r a t i o v a l u e s w i t h i n columns . ”””import randomi f ( seed != None ) :

random . seed ( seed )# Transpose t h e e x p r e s s i o n m a t r i xc o l s = [ ]f o r c o l i n x range ( l e n ( s e l f . num [ 0 ] ) ) :

c o l s . append ( [ row [ c o l ] f o r row i n s e l f . num ] )# S h u f f l ef o r i i n c o l s :

random . s h u f f l e ( i )# Transpose back to o r i g i n a l o r i e n t a t i o ns e l f . num = [ ]f o r row i n x range ( l e n ( c o l s ) ) :

s e l f . num . append ( [ c o l [ row ] f o r c o l i n row ] )

Mark Voorhies Practical Bioinformatics

Page 31: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of theEisen paper (removing correlations among genes and/or arrays).

def s hu f f l eRows ( s e l f , s eed = None ) :”””Permute r a t i o v a l u e s w i t h i n rows . ”””import randomi f ( seed != None ) :

random . seed ( seed )f o r i i n s e l f . num :

random . s h u f f l e ( i )

def s h u f f l e C o l s ( s e l f , s eed = None ) :”””Permute r a t i o v a l u e s w i t h i n columns . ”””import randomi f ( seed != None ) :

random . seed ( seed )# Transpose t h e e x p r e s s i o n m a t r i xc o l s = [ ]f o r c o l i n x range ( l e n ( s e l f . num [ 0 ] ) ) :

c o l s . append ( [ row [ c o l ] f o r row i n s e l f . num ] )# S h u f f l ef o r i i n c o l s :

random . s h u f f l e ( i )# Transpose back to o r i g i n a l o r i e n t a t i o ns e l f . num = [ ]f o r row i n x range ( l e n ( c o l s ) ) :

s e l f . num . append ( [ c o l [ row ] f o r c o l i n row ] )

Mark Voorhies Practical Bioinformatics

Page 32: Practical Bioinformaticshisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides04_DistanceMatrices.pdfMark Voorhies Practical Bioinformatics. Homework 1 Explore di erent clustering methods

Homework

1 Explore different clustering methods and/or distance methods2 Try additional shufflings of the data: how do they affect your

ability to cluster the data? C.f. figure 3 the Eisen paper

Permute the columnsIndependently permute the columns of each row

Mark Voorhies Practical Bioinformatics