Single nucleotide polymorphisms and applications Usman Roshan BNFO 601

Single nucleotide polymorphisms and

applicationsUsman Roshan

BNFO 601

SNPs

• DNA sequence variations that occur when a single nucleotide is altered.

• Must be present in at least 1% of the population to be a SNP.

• Occur every 100 to 300 bases along the 3 billion-base human genome.

• Many have no effect on cell function but some could affect disease risk and drug response.

Toy example

SNPs on the chromosome

SNP

Chromosome

Gene

Bi-allelic SNPs

• Most SNPs have one of two nucleotides at a given position

• For example:– A/G denotes the varying nucleotide as

either A or G. We call each of these an allele

– Most SNPs have two alleles (bi-allelic)

SNP genotype

• We inherit two copies of each chromosome (one from each parent)

• For a given SNP the genotype defines the type of alleles we carry

• Example: for the SNP A/G one’s genotype may be– AA if both copies of the chromosome have A– GG if both copies of the chromosome have G– AG or GA if one copy has A and the other has G– The first two cases are called homozygous and latter

two are heterozygous

SNP genotyping

Real SNPs

• SNP consortium: snp.cshl.org

• SNPedia: www.snpedia.com

Application of SNPs: association with disease

• Experimental design to detect cancer associated SNPs:– Pick random humans with and without

cancer (say breast cancer)– Perform SNP genotyping– Look for associated SNPs – Also called genome-wide association study

Case-control example

• Study of 100 people:– Case: 50 subjects with

cancer

– Control: 50 subjects without cancer

• Count number of alleles and form a contingency table

#Allele1 #Allele2

Case 10 90

Control 2 98

Effect of population structure on genome-wide association

studies• Suppose our sample is drawn from a

population of two groups, I and II• Assume that group I has a majority of allele

type I and group II has mostly the second allele.

• Further assume that most case subjects belong to group I and most control to group II

• This leads to the false association that the major allele is associated with the disease

Effect of population structure on genome-wide association

studies• We can correct this effect if case and

control are equally sampled from all sub-populations

• To do this we need to know the population structure

Population structure prediction

• Treated as an unsupervised learning problem (i.e. clustering)

Clustering

• Suppose we want to cluster n vectors in Rd into two groups. Define C1 and C2 as the two groups.

• Our objective is to find C1 and C2 that minimize

where mi is the mean of class Ci

€

|| x j −mi ||2

x j ∈C i

∑i=1

2

∑

K-means algorithm for two clusters

Input: Algorithm:

1. Initialize: assign xi to C1 or C2 with equal probability and compute means:

2. Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||, otherwise assign to C2

3. Recompute means m1 and m2

4. Compute objective

5. Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2.

€

x i ∈ Rd ,i =1K n

€

m1 =1

C1

x ixi ∈C1

∑

€

m2 =1

C2

x ixi ∈C2

∑

€

|| x j −mi ||2

x j ∈C i

∑i=1

2

∑

€

δ

K-means

• Is it guaranteed to find the clustering which optimizes the objective?

• It is guaranteed to find a local optimal

• We can prove that the objective decreases with subsequence iterations

Proof sketch of convergence of k-means

€

|| x j −mi ||2

x j ∈C i

∑i=1

2

∑ ≥

|| x j −mi ||2

x j ∈C i*

∑i=1

2

∑ ≥

|| x j −mi* ||2

x j ∈C i*

∑i=1

2

∑

Justification of first inequality: by assigning xj to the closest mean the objective decreases or stays the same

Justification of second inequality: for a given cluster its mean minimizes squared error loss

Documents

Single nucleotide polymorphisms and applications Usman Roshan BNFO 601