View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Single nucleotide polymorphisms and
applicationsUsman Roshan
BNFO 601
SNPs
• DNA sequence variations that occur when a single nucleotide is altered.
• Must be present in at least 1% of the population to be a SNP.
• Occur every 100 to 300 bases along the 3 billion-base human genome.
• Many have no effect on cell function but some could affect disease risk and drug response.
Toy example
SNPs on the chromosome
SNP
Chromosome
Gene
Bi-allelic SNPs
• Most SNPs have one of two nucleotides at a given position
• For example:– A/G denotes the varying nucleotide as
either A or G. We call each of these an allele
– Most SNPs have two alleles (bi-allelic)
SNP genotype
• We inherit two copies of each chromosome (one from each parent)
• For a given SNP the genotype defines the type of alleles we carry
• Example: for the SNP A/G one’s genotype may be– AA if both copies of the chromosome have A– GG if both copies of the chromosome have G– AG or GA if one copy has A and the other has G– The first two cases are called homozygous and latter
two are heterozygous
SNP genotyping
Real SNPs
• SNP consortium: snp.cshl.org
• SNPedia: www.snpedia.com
Application of SNPs: association with disease
• Experimental design to detect cancer associated SNPs:– Pick random humans with and without
cancer (say breast cancer)– Perform SNP genotyping– Look for associated SNPs – Also called genome-wide association study
Case-control example
• Study of 100 people:– Case: 50 subjects with
cancer
– Control: 50 subjects without cancer
• Count number of alleles and form a contingency table
#Allele1 #Allele2
Case 10 90
Control 2 98
Effect of population structure on genome-wide association
studies• Suppose our sample is drawn from a
population of two groups, I and II• Assume that group I has a majority of allele
type I and group II has mostly the second allele.
• Further assume that most case subjects belong to group I and most control to group II
• This leads to the false association that the major allele is associated with the disease
Effect of population structure on genome-wide association
studies• We can correct this effect if case and
control are equally sampled from all sub-populations
• To do this we need to know the population structure
Population structure prediction
• Treated as an unsupervised learning problem (i.e. clustering)
Clustering
• Suppose we want to cluster n vectors in Rd into two groups. Define C1 and C2 as the two groups.
• Our objective is to find C1 and C2 that minimize
where mi is the mean of class Ci
€
|| x j −mi ||2
x j ∈C i
∑i=1
2
∑
K-means algorithm for two clusters
Input: Algorithm:
1. Initialize: assign xi to C1 or C2 with equal probability and compute means:
2. Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||, otherwise assign to C2
3. Recompute means m1 and m2
4. Compute objective
5. Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2.
€
x i ∈ Rd ,i =1K n
€
m1 =1
C1
x ixi ∈C1
∑
€
m2 =1
C2
x ixi ∈C2
∑
€
|| x j −mi ||2
x j ∈C i
∑i=1
2
∑
€
δ
K-means
• Is it guaranteed to find the clustering which optimizes the objective?
• It is guaranteed to find a local optimal
• We can prove that the objective decreases with subsequence iterations
Proof sketch of convergence of k-means
€
|| x j −mi ||2
x j ∈C i
∑i=1
2
∑ ≥
|| x j −mi ||2
x j ∈C i*
∑i=1
2
∑ ≥
|| x j −mi* ||2
x j ∈C i*
∑i=1
2
∑
Justification of first inequality: by assigning xj to the closest mean the objective decreases or stays the same
Justification of second inequality: for a given cluster its mean minimizes squared error loss