Upload
morris-oneal
View
216
Download
0
Embed Size (px)
Citation preview
1
Genes and MS in Tasmania, completed.
Lecture 7, Statistics 246February 12, 2004
2
Towards a sharing statistic
Our aim was to come up with a statistic that effectively describes haplotype sharing differences between case and “control” haplotypes
The sharing statistic should be largest at markers closest to a disease locus, as haplotype sharing there should- extend the furthest; &
- the association of disease with particular haplotypes should be strongest
3
Nonparametric haplotype sharing analysis
Why nonparametric, rather than likelihood-based methods?
• Likelihood methods make assumptions regarding the genealogy of the population, and we don’t how many of these assumptions are robust to violations.
• Likelihood methods are computationally intensive, especially for genome wide scans, where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC)
• Likelihood methods have a hard time at the HLA region, because the LD there is extremely high and non uniform (block-like structure)
• Simpler statistics will probably do better here, unless we can model background LD
4
Haplotype sharing statistics for genome wide scan data cf. fine mapping
• Previous (usually likelihood-based) statistics have concentrated on fine mapping and the exact localization of a variant allele. They assume a signal exists.
• For us, localization was not the primary interest. Rather, detection was our main interest, using a genome-wide scan
• We needed something that was not as computationally intensive as DHSMAP (McPeek & Strahs, 1999), BLADE (Liu et al, 2001), DMLE+ (Rannala & Reeve, 2001), or the shattered coalescent (Morris et al , 2002).
5
Haplo_clusters (Melanie Bahlo)
• Calculates a sharing statistic at every marker
• Obtains a p-value at every marker using a permutation test
• Allows for several clusters of ancestral haplotypes (allelic heterogeneity)
6
3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 21 2 1 3 5 6 10 1 5 2 1 3 42 3 7 3 1 6 10 9 1 1 2 5 6
5 9 1 1 4 1 3 1 2 3 1 9 87 6 5 3 1 3 2 1 5 9 7 9 1
Controls 7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5
Testing for shared haplotypes
Score for haplotype sharing (- log p)
Pter- -Qter
7
Sharing drop-off & allelic heterogeneity
Marker Proportions of Cases Proportions of Controls
1
2
3
4
= Cluster 1 haplotypes
= Cluster 2 haplotypes
= neither cluster 1 nor 2 haplotypes
8
Haplo_cluster in action
Haplotype 1 1 1 2 1 3 2 1 3 1
Controls 1 1 1 0 1
Cases 0 0 0 3 0
Example: Sorting on marker 1 for a sample of 3 case and 4 control haplotypes
2 1 3
2 1 3
2 1 4
1 1 2
1 2 3
1 3 3
3 1 2 After sort on haplotype consisting only of marker 1, calculate a chi-square statistic, and move onCases Controls
Haplotype 1 2 3
Controls 3 0 1
Cases 0 3 0
After sorting on haplotype consisting of marker 1 and marker 2, calculate a chi-square statistic, and ….
Eventually stop, and sum the chi-square statistics. Then repeat for a suitably large number of random permutations of cases and controls.
9
Statistic to evaluate haplotype sharing
Sharing statistic is 2 based, using the idea of multiple ancestral haplotypes (clusters) which are “grown” starting at each marker examined in the scan.
Significance is evaluated via a permutation test: choose a random permutation of the pooled cases and controls, and recalculate the statistic; repeat ~20,000 times.
A recursive form for the estimator and and the SD of the p-value was used, to enable early termination of program
€
Si = χ i, j ,k2
k=1
K
∑j=1
K
∑
χ i, j ,k2 = χ 1
2 test for associationbetweenthenumber
of case and control haplotypes still sharing the
ancestral haplotype of cluster k at marker j,
after starting at marker i.
10
The permutation test
The idea is this. We have 170 cases and 105 controls, and at any particular locus, we calculate the value of our statistic, calling it S.
Now pool our cases and controls into 275 individuals, and sample 170 to be “cases” at random from the 275, calling the remainder “controls”. For this first artificial set of cases and controls, calculate the value of our statistic, S1 say.
Next, we repeat this procedure 9,999 more times, say, obtaining values S2 , S3 , S4 … S10,000 . As long as 10,000 is sufficiently many random permutations, we can get a good estimate of the p-value of our initial statistic relative to our empirically estimated null distribution, as p = #{i: Si > S }/10,000.
11
Exercises
1. How should we decide what number of resamplings is large enough?
2. Explain in the simple case of a 22 table of cases and controls cross-classified as diseased and healthy, how using all possible resamplings, rather than a fixed size random sample, leads to the p-value for the exact test.
3. To avoid carrying out an unnecessarily large number of permutations, the proportion of resampled values of our statistic exceeding the value S can be monitored. Can you describe a stopping rule for the random resamplings that should lead to “accurate enough” p-values, without going to the full number each time?
12
Haplo_clusters - Output
-opt 1 Genetic distances used to decide order of markers to sort on-c 1 The number of clusters of haplotypes to look for = 1-miss 1 The missing data is replaced randomly using the 2 marker haplotype information.-share 5 The number of haplotypes needed to share = 5The standard deviation p values are calculated to 0.01*phat.Marker names have been provided and will be used in the output files.# of case haplotypes = 338# of contol haplotypes = 208# of markers = 11# of perms = 100000Marker Mapdistance Chi_Square p sd(p) -log(p) perms====================================================D21S1911 0 5.34 4.44e-01 4.44e-03 0.35 12510D21S1904 0.85 6.17 3.63e-01 3.63e-03 0.44 17577D21S1899 10.36 5.89 4.37e-01 4.37e-03 0.36 12876D21S1922 16.46 2.97 6.83e-01 6.83e-03 0.17 4636D21S1884 17.26 4.74 4.14e-01 4.14e-03 0.38 14135D21S1914 20.82 6.49 3.38e-01 3.38e-03 0.47 19571D21S263 28.97 4.06 5.24e-01 5.24e-03 0.28 9077D21S1252 39.41 1.18 8.66e-01 8.65e-03 0.06 1553D21S1919 42.51 1.38 8.51e-01 8.51e-03 0.07 1751D21S1255 43.81 2.24 7.24e-01 7.24e-03 0.14 3805D21S266 51.51 3.86 5.70e-01 5.70e-03 0.24 7557===================================================
13
Haplo_clusters - Output II
Table of haplotypesMarker Cluster Haplotype Length(Haplotype)
D21S1911 D21S1904 D21S1899 D21S1922 D21S1884 D21S1914 D21S263 D21S125===================================================D21S1884 1 - - 6 3 3 8 11 -# of haplos: - - 5 82 163 22 2 -Chi-square: - - 0.2 0.0 3.2 0.1 1.2 -
D21S1914 1 - - 7 3 3 5 2 -# of haplos: - - 4 16 34 58 10 -Chi-square: - - 2.5 0.0 0.5 2.1 1.3 -
D21S263 1 - - - 4 3 10 2 5# of haplos: - - - 3 5 24 138 9Chi-square: - - - 1.9 1.2 0.7 0.0 0.3
D21S1252 1 - - - - - - 4 5# of haplos: - - - - - - 6 83Chi-square: - - - - - - 0.6 0.1
D21S1919 1 - - - - - - 2 2# of haplos: - - - - - - 3 7Chi-square: - - - - - - 0.3 0.0
Etc etc etc
===================================================Time taken (m) = 55, 23/6/2003, 11:15:12Haplo_cluster.pl$Revision:1.15$
14
Output for Chromosome 6
HLA Region – p-value <0.00001. Peak contains D6S105, MOGCA,
15
Empirical distributions of statistic, chr 6
Off scale
16
Comparison of Two Positive Controls against Two Negative Controls
Cases versus Controls
Cases versus Untransmitteds
Untransmitted versus Controls
Controls versus Untransmitted
17
Uniform qq-plots and multiple testing
When we carry out ~800 tests, as we have here, we expect to see many quite small p-values under the combined null hypothesis of no case-control haplotype differences anywhere, specifically, about 40 smaller than the usual 5% cutoff. In practice, we believe that at most a few of these 800 nulls will be false. How do we adjust our p-values for this multiplicity of tests?
One fairly severe way is known as the Bonferroni adjustment: to multiply all
our p-values by 800. Another approach is this: rather than compare our 800 p-values to the single test 5% cutoff, we compare them all to that value which the smallest of 800 i.i.d uniforms will exceed 95% of the time.
Exercises 1. Prove that Bonferroni procedure is conservative, in that the family-wise
type 1 error (the chance of one or more type 1 errors) under the assumption that all the null hypotheses are true, is ≤ 5%.
2. Calculate the 5th percentile of the smallest of 800 i.i.d. uniforms. How close is it to the Bonferroni 5th percentile?
18
Uniform qq-plots and multiple testing, cont. The procedure just described is still conservative, for two reasons. Firstly, the p-values are not independent, though they should be
identically distributed under the null hypothesis. There are ways to incorporate this into our analysis, the most direct being to estimate the joint resampling distribution of the test statistics for every marker. This can be computationally prohibitive, especially if we also want address the next point, which is:
Only the smallest of the p-values should be compared to the smallest of an i.i.d. or suitably dependent sequence of 800 p-values. The second smallest p-value should more correctly be compared to something slightly different, and so on. This leads is to the notion of step-wise multiple testing procedures. Resampling-based stepwise multiple testing corrections can be very computationally intensive.
In our present case we did no more than create a uniform qq-plot, and
look at the number of loci “off the line” at the low end. Why? In part for computational reasons; in part, because we plan to follow up “promising” regions even if they do not have small adjusted p-values.
19
Distribution of p-values:uniform qq plots
Expected
Observed
Observed
Expected
20
Reproducibility: same datasets, different random number seeds
21
Similar method/problem
Similar method
Haplotype Pattern Mining (Toivonen et al, 2000). Ingileif Hallgrimsdottir (Statistics, UCB) modified and extended this method, and her (blindly derived) results were very similar to those obtained using Haplo_cluster on the MS data.
Similar problem
A study of bipolar disorder in the Central valley of Costa Rica (Service et al, 2001, Ophoff et al, 2002) involves an admixed population of Amerindian and Spanish people, few founders, little immigration, ~300 years old. They use likelihood methods on 3-locus haplotypes, but didn’t use controls.
22
What next for the MS study (apart from more analysis)?
• A close study of the MHC (HLA) region was conducted and published
• Relatedness of cases and controls was studied more carefully, and a few “too close” relatives identified and removed, but leaving the main conclusions unchanged
• Fine mapping around peaks was carried out: some peaks were strengthened, others disappeared. Further fine mapping under way.
• The Tasmanian cases are being joined by ethnically similar cases from the mainland, and genotyping of these new individuals in candidate regions is under way
• International collaboration is also under way
• We want to find genes and amino acid changes, if at all possible
23
Fine mapping: two regions
0
1
2
3
4
B
B
B BJ
J
J JH
H H HP P
P
P0
1
2
3
4
5A #3
B B
B
B
B
BB
BB B B B BJ J
JJ
J J
J JJ J
J J JH H
H H H
HH
H
H
HH
H HP P P P P PP
P
PP
P P P
B #3
B B
B B B B BB
B
BB B
B
J
J J JJ
J
J
J J
J JJ
JHH
H
H
HH
H
H H H HH HP
PP P
PP P
P P P P P P0
1
2
3
4
B
BB
BJ
J
J
JH
H
H
HP
PP
P0
1
2
3
4
A #4
B #4
-log P10
B Case vs. Control J Case vs. Case UT
H Control vs. Case UT P Case UT vs. Control
24
Relatedness in cases and controls
We assume that our cases and controls are mostly representative of the “Tasmanian population”.
If they are too closely related (within cases or controls)
we might expect bias in our sharing statistic.
If they are not closely enough related (within cases or controls) we might expect trouble detecting a signal.
25
This pedigree is similar to the type of pedigree found in Tasmania. The “affected” individuals are represented by the filled in symbols.
26
Determining the relatedness of Tasmanians based on GWS data
• Determine the level of relatedness of all pairs based on the genome wide scan data (another HMM analysis)
• We found several pairs which were much more closely related than the 10-12 meioses (6-8 generations) expected– 10 pairs in the case data– 6 pairs in the control data– 2 pairs in the case and control data
• Some of these relationships were subsequently verified with further genealogical research
• We re-did the analyses without these people
27
Does having closely related cases or controls make a difference?
Cases versus Controls
Cases versus Controls (relateds removed)Cases versus Untransmitteds
Cases versus Untransmitteds (relateds removed)
28
HLA Region & MS
• MS is believed to be an autoimmune disease (similar to type I diabetes)
• HLA association with MS previously identified
• One or more genes?
Log linear modelling with partial haplotypes suggests
that two regions were responsible and that these did not interact
29
The HLA complex
Klein J. et al New Eng J Med, 2000; 343:702-709
An extremely gene-rich region.
30
QuickTime™ and aGraphics decompressorare needed to see this picture.
B B BB
BB
B BB B
B BB B B
B
BB
B
BB
B B B
BB
B BB
B
B B
B B
B B
B B
B B
BB
B
B
B
B B
B B
J J JJ
JJ
J JJ J
J J JJ J
J
J J
J
JJ
J J JJ J
J
J
JJ
J JJ J
J J
J J J J
J
J
J
J
J
JJ
J J
0
0.5
1
1.5
2
2.5
B -log10 (P-value) 1000 perms
J -log10 (P-value) 5000 perms
850 kb850-Kb
Microsatellite markers that spanned the HLA complex generated a peak of association in an 850-Kb segment of the class I region
We have implicated an 850-Kb class I region in MS
31
32
• The TNF locus + 15 other class III genes have no influence on disease - association due to strong LD with DR15
Genetic dissection of the HLA region by haplotype analysis
• The HLA region encodes at least two independent susceptibility loci for MS
III III
MOG F G A E C B TNFDRB1 DQB1
DPB1
DRB1*1501-DQB1*0602
√ √X
(Rubio et al. 2002 AJHG)
33
I III II3.6 Mb5.1 Mb
(~1 cM)
D6
S2
99
An extended haplotype across HLAconfers increased risk to MS
D6
S1
05
D6
S4
64
D6
S2
22
3
MO
GC
A
D6
S2
65
5
HL
A-F
D6
S5
10
DQ
B1
DR
B1
D6
S2
91
3 6 5 3 3 5 1 *1501 *0602Ancestralhaplotype
RR=4.3
RR=5.7
(DR15)
34
AcknowledgmentsMCPHR, HobartIngrid van der MeiTrish GroomKristen HazelwoodJane PittawayRhonda McCoyLyn HallTracy LoweNatasha Newton Emma StubbsMichele SaleMaree RingAnnette BanksJoan CloughTim AlbionJo DickinsonShelly BrownSue SawbridgeDeirbhile O’ByrneBruce TaylorStan SjeicaAndrew HughesBozidar DrulovicTerry Dwyer
WEHI
Justin RubioLaura JohnsonRachel BurfootStewart HuxtableSimon Foote
ANRMSF, Canberra.Rex Simmons
MCRI
Funding: The Genes-CRC The National Multiple Sclerosis Society (USA) Department of Neurosciences RMH MS Australia NH&MRC (Australia)
VTIS
The Tasmanian and Victorian public
The MS Societies of Victoria and Tasmania
Brian TaitMike Varney
Bob Williamson
The AGRF (Melbourne)
RMHNiall TubridyJo BakerJohn Cary
Trevor KilpatrickHelmut ButzkuevenMark Marriot
Melanie BahloJim StankovichChris Wilkinson