1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004

1

Genes and MS in Tasmania, completed.

Lecture 7, Statistics 246February 12, 2004

2

Towards a sharing statistic

Our aim was to come up with a statistic that effectively describes haplotype sharing differences between case and “control” haplotypes

The sharing statistic should be largest at markers closest to a disease locus, as haplotype sharing there should- extend the furthest; &

- the association of disease with particular haplotypes should be strongest

3

Nonparametric haplotype sharing analysis

Why nonparametric, rather than likelihood-based methods?

• Likelihood methods make assumptions regarding the genealogy of the population, and we don’t how many of these assumptions are robust to violations.

• Likelihood methods are computationally intensive, especially for genome wide scans, where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC)

• Likelihood methods have a hard time at the HLA region, because the LD there is extremely high and non uniform (block-like structure)

• Simpler statistics will probably do better here, unless we can model background LD

4

Haplotype sharing statistics for genome wide scan data cf. fine mapping

• Previous (usually likelihood-based) statistics have concentrated on fine mapping and the exact localization of a variant allele. They assume a signal exists.

• For us, localization was not the primary interest. Rather, detection was our main interest, using a genome-wide scan

• We needed something that was not as computationally intensive as DHSMAP (McPeek & Strahs, 1999), BLADE (Liu et al, 2001), DMLE+ (Rannala & Reeve, 2001), or the shattered coalescent (Morris et al , 2002).

5

Haplo_clusters (Melanie Bahlo)

• Calculates a sharing statistic at every marker

• Obtains a p-value at every marker using a permutation test

• Allows for several clusters of ancestral haplotypes (allelic heterogeneity)

6

3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 21 2 1 3 5 6 10 1 5 2 1 3 42 3 7 3 1 6 10 9 1 1 2 5 6

5 9 1 1 4 1 3 1 2 3 1 9 87 6 5 3 1 3 2 1 5 9 7 9 1

Controls 7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5

Testing for shared haplotypes

Score for haplotype sharing (- log p)

Pter- -Qter

7

Sharing drop-off & allelic heterogeneity

Marker Proportions of Cases Proportions of Controls

1

2

3

4

= Cluster 1 haplotypes

= Cluster 2 haplotypes

= neither cluster 1 nor 2 haplotypes

8

Haplo_cluster in action

Haplotype 1 1 1 2 1 3 2 1 3 1

Controls 1 1 1 0 1

Cases 0 0 0 3 0

Example: Sorting on marker 1 for a sample of 3 case and 4 control haplotypes

2 1 3

2 1 3

2 1 4

1 1 2

1 2 3

1 3 3

3 1 2 After sort on haplotype consisting only of marker 1, calculate a chi-square statistic, and move onCases Controls

Haplotype 1 2 3

Controls 3 0 1

Cases 0 3 0

After sorting on haplotype consisting of marker 1 and marker 2, calculate a chi-square statistic, and ….

Eventually stop, and sum the chi-square statistics. Then repeat for a suitably large number of random permutations of cases and controls.

9

Statistic to evaluate haplotype sharing

Sharing statistic is 2 based, using the idea of multiple ancestral haplotypes (clusters) which are “grown” starting at each marker examined in the scan.

Significance is evaluated via a permutation test: choose a random permutation of the pooled cases and controls, and recalculate the statistic; repeat ~20,000 times.

A recursive form for the estimator and and the SD of the p-value was used, to enable early termination of program

€

Si = χ i, j ,k2

k=1

K

∑j=1

K

∑

χ i, j ,k2 = χ 1

2 test for associationbetweenthenumber

of case and control haplotypes still sharing the

ancestral haplotype of cluster k at marker j,

after starting at marker i.

10

The permutation test

The idea is this. We have 170 cases and 105 controls, and at any particular locus, we calculate the value of our statistic, calling it S.

Now pool our cases and controls into 275 individuals, and sample 170 to be “cases” at random from the 275, calling the remainder “controls”. For this first artificial set of cases and controls, calculate the value of our statistic, S1 say.

Next, we repeat this procedure 9,999 more times, say, obtaining values S2 , S3 , S4 … S10,000 . As long as 10,000 is sufficiently many random permutations, we can get a good estimate of the p-value of our initial statistic relative to our empirically estimated null distribution, as p = #{i: Si > S }/10,000.

11

Exercises

1. How should we decide what number of resamplings is large enough?

2. Explain in the simple case of a 22 table of cases and controls cross-classified as diseased and healthy, how using all possible resamplings, rather than a fixed size random sample, leads to the p-value for the exact test.

3. To avoid carrying out an unnecessarily large number of permutations, the proportion of resampled values of our statistic exceeding the value S can be monitored. Can you describe a stopping rule for the random resamplings that should lead to “accurate enough” p-values, without going to the full number each time?

12

Haplo_clusters - Output

-opt 1 Genetic distances used to decide order of markers to sort on-c 1 The number of clusters of haplotypes to look for = 1-miss 1 The missing data is replaced randomly using the 2 marker haplotype information.-share 5 The number of haplotypes needed to share = 5The standard deviation p values are calculated to 0.01*phat.Marker names have been provided and will be used in the output files.# of case haplotypes = 338# of contol haplotypes = 208# of markers = 11# of perms = 100000Marker Mapdistance Chi_Square p sd(p) -log(p) perms====================================================D21S1911 0 5.34 4.44e-01 4.44e-03 0.35 12510D21S1904 0.85 6.17 3.63e-01 3.63e-03 0.44 17577D21S1899 10.36 5.89 4.37e-01 4.37e-03 0.36 12876D21S1922 16.46 2.97 6.83e-01 6.83e-03 0.17 4636D21S1884 17.26 4.74 4.14e-01 4.14e-03 0.38 14135D21S1914 20.82 6.49 3.38e-01 3.38e-03 0.47 19571D21S263 28.97 4.06 5.24e-01 5.24e-03 0.28 9077D21S1252 39.41 1.18 8.66e-01 8.65e-03 0.06 1553D21S1919 42.51 1.38 8.51e-01 8.51e-03 0.07 1751D21S1255 43.81 2.24 7.24e-01 7.24e-03 0.14 3805D21S266 51.51 3.86 5.70e-01 5.70e-03 0.24 7557===================================================

13

Haplo_clusters - Output II

Table of haplotypesMarker Cluster Haplotype Length(Haplotype)

D21S1911 D21S1904 D21S1899 D21S1922 D21S1884 D21S1914 D21S263 D21S125===================================================D21S1884 1 - - 6 3 3 8 11 -# of haplos: - - 5 82 163 22 2 -Chi-square: - - 0.2 0.0 3.2 0.1 1.2 -

D21S1914 1 - - 7 3 3 5 2 -# of haplos: - - 4 16 34 58 10 -Chi-square: - - 2.5 0.0 0.5 2.1 1.3 -

D21S263 1 - - - 4 3 10 2 5# of haplos: - - - 3 5 24 138 9Chi-square: - - - 1.9 1.2 0.7 0.0 0.3

D21S1252 1 - - - - - - 4 5# of haplos: - - - - - - 6 83Chi-square: - - - - - - 0.6 0.1

D21S1919 1 - - - - - - 2 2# of haplos: - - - - - - 3 7Chi-square: - - - - - - 0.3 0.0

Etc etc etc

===================================================Time taken (m) = 55, 23/6/2003, 11:15:12Haplo_cluster.pl$Revision:1.15$

14

Output for Chromosome 6

HLA Region – p-value <0.00001. Peak contains D6S105, MOGCA,

15

Empirical distributions of statistic, chr 6

Off scale

16

Comparison of Two Positive Controls against Two Negative Controls

Cases versus Controls

Cases versus Untransmitteds

Untransmitted versus Controls

Controls versus Untransmitted

17

Uniform qq-plots and multiple testing

When we carry out ~800 tests, as we have here, we expect to see many quite small p-values under the combined null hypothesis of no case-control haplotype differences anywhere, specifically, about 40 smaller than the usual 5% cutoff. In practice, we believe that at most a few of these 800 nulls will be false. How do we adjust our p-values for this multiplicity of tests?

One fairly severe way is known as the Bonferroni adjustment: to multiply all

our p-values by 800. Another approach is this: rather than compare our 800 p-values to the single test 5% cutoff, we compare them all to that value which the smallest of 800 i.i.d uniforms will exceed 95% of the time.

Exercises 1. Prove that Bonferroni procedure is conservative, in that the family-wise

type 1 error (the chance of one or more type 1 errors) under the assumption that all the null hypotheses are true, is ≤ 5%.

2. Calculate the 5th percentile of the smallest of 800 i.i.d. uniforms. How close is it to the Bonferroni 5th percentile?

18

Uniform qq-plots and multiple testing, cont. The procedure just described is still conservative, for two reasons. Firstly, the p-values are not independent, though they should be

identically distributed under the null hypothesis. There are ways to incorporate this into our analysis, the most direct being to estimate the joint resampling distribution of the test statistics for every marker. This can be computationally prohibitive, especially if we also want address the next point, which is:

Only the smallest of the p-values should be compared to the smallest of an i.i.d. or suitably dependent sequence of 800 p-values. The second smallest p-value should more correctly be compared to something slightly different, and so on. This leads is to the notion of step-wise multiple testing procedures. Resampling-based stepwise multiple testing corrections can be very computationally intensive.

In our present case we did no more than create a uniform qq-plot, and

look at the number of loci “off the line” at the low end. Why? In part for computational reasons; in part, because we plan to follow up “promising” regions even if they do not have small adjusted p-values.

19

Distribution of p-values:uniform qq plots

Expected

Observed

Observed

Expected

20

Reproducibility: same datasets, different random number seeds

21

Similar method/problem

Similar method

Haplotype Pattern Mining (Toivonen et al, 2000). Ingileif Hallgrimsdottir (Statistics, UCB) modified and extended this method, and her (blindly derived) results were very similar to those obtained using Haplo_cluster on the MS data.

Similar problem

A study of bipolar disorder in the Central valley of Costa Rica (Service et al, 2001, Ophoff et al, 2002) involves an admixed population of Amerindian and Spanish people, few founders, little immigration, ~300 years old. They use likelihood methods on 3-locus haplotypes, but didn’t use controls.

22

What next for the MS study (apart from more analysis)?

• A close study of the MHC (HLA) region was conducted and published

• Relatedness of cases and controls was studied more carefully, and a few “too close” relatives identified and removed, but leaving the main conclusions unchanged

• Fine mapping around peaks was carried out: some peaks were strengthened, others disappeared. Further fine mapping under way.

• The Tasmanian cases are being joined by ethnically similar cases from the mainland, and genotyping of these new individuals in candidate regions is under way

• International collaboration is also under way

• We want to find genes and amino acid changes, if at all possible

23

Fine mapping: two regions

0

1

2

3

4

B

B

B BJ

J

J JH

H H HP P

P

P0

1

2

3

4

5A #3

B B

B

B

B

BB

BB B B B BJ J

JJ

J J

J JJ J

J J JH H

H H H

HH

H

H

HH

H HP P P P P PP

P

PP

P P P

B #3

B B

B B B B BB

B

BB B

B

J

J J JJ

J

J

J J

J JJ

JHH

H

H

HH

H

H H H HH HP

PP P

PP P

P P P P P P0

1

2

3

4

B

BB

BJ

J

J

JH

H

H

HP

PP

P0

1

2

3

4

A #4

B #4

-log P10

B Case vs. Control J Case vs. Case UT

H Control vs. Case UT P Case UT vs. Control

24

Relatedness in cases and controls

We assume that our cases and controls are mostly representative of the “Tasmanian population”.

If they are too closely related (within cases or controls)

we might expect bias in our sharing statistic.

If they are not closely enough related (within cases or controls) we might expect trouble detecting a signal.

25

This pedigree is similar to the type of pedigree found in Tasmania. The “affected” individuals are represented by the filled in symbols.

26

Determining the relatedness of Tasmanians based on GWS data

• Determine the level of relatedness of all pairs based on the genome wide scan data (another HMM analysis)

• We found several pairs which were much more closely related than the 10-12 meioses (6-8 generations) expected– 10 pairs in the case data– 6 pairs in the control data– 2 pairs in the case and control data

• Some of these relationships were subsequently verified with further genealogical research

• We re-did the analyses without these people

27

Does having closely related cases or controls make a difference?

Cases versus Controls

Cases versus Controls (relateds removed)Cases versus Untransmitteds

Cases versus Untransmitteds (relateds removed)

28

HLA Region & MS

• MS is believed to be an autoimmune disease (similar to type I diabetes)

• HLA association with MS previously identified

• One or more genes?

Log linear modelling with partial haplotypes suggests

that two regions were responsible and that these did not interact

29

The HLA complex

Klein J. et al New Eng J Med, 2000; 343:702-709

An extremely gene-rich region.

30

QuickTime™ and aGraphics decompressorare needed to see this picture.

B B BB

BB

B BB B

B BB B B

B

BB

B

BB

B B B

BB

B BB

B

B B

B B

B B

B B

B B

BB

B

B

B

B B

B B

J J JJ

JJ

J JJ J

J J JJ J

J

J J

J

JJ

J J JJ J

J

J

JJ

J JJ J

J J

J J J J

J

J

J

J

J

JJ

J J

0

0.5

1

1.5

2

2.5

B -log10 (P-value) 1000 perms

J -log10 (P-value) 5000 perms

850 kb850-Kb

Microsatellite markers that spanned the HLA complex generated a peak of association in an 850-Kb segment of the class I region

We have implicated an 850-Kb class I region in MS

31

32

• The TNF locus + 15 other class III genes have no influence on disease - association due to strong LD with DR15

Genetic dissection of the HLA region by haplotype analysis

• The HLA region encodes at least two independent susceptibility loci for MS

III III

MOG F G A E C B TNFDRB1 DQB1

DPB1

DRB1*1501-DQB1*0602

√ √X

(Rubio et al. 2002 AJHG)

33

I III II3.6 Mb5.1 Mb

(~1 cM)

D6

S2

99

An extended haplotype across HLAconfers increased risk to MS

D6

S1

05

D6

S4

64

D6

S2

22

3

MO

GC

A

D6

S2

65

5

HL

A-F

D6

S5

10

DQ

B1

DR

B1

D6

S2

91

3 6 5 3 3 5 1 *1501 *0602Ancestralhaplotype

RR=4.3

RR=5.7

(DR15)

34

AcknowledgmentsMCPHR, HobartIngrid van der MeiTrish GroomKristen HazelwoodJane PittawayRhonda McCoyLyn HallTracy LoweNatasha Newton Emma StubbsMichele SaleMaree RingAnnette BanksJoan CloughTim AlbionJo DickinsonShelly BrownSue SawbridgeDeirbhile O’ByrneBruce TaylorStan SjeicaAndrew HughesBozidar DrulovicTerry Dwyer

WEHI

Justin RubioLaura JohnsonRachel BurfootStewart HuxtableSimon Foote

ANRMSF, Canberra.Rex Simmons

MCRI

Funding: The Genes-CRC The National Multiple Sclerosis Society (USA) Department of Neurosciences RMH MS Australia NH&MRC (Australia)

VTIS

The Tasmanian and Victorian public

The MS Societies of Victoria and Tasmania

Brian TaitMike Varney

Bob Williamson

The AGRF (Melbourne)

RMHNiall TubridyJo BakerJohn Cary

Trevor KilpatrickHelmut ButzkuevenMark Marriot

Melanie BahloJim StankovichChris Wilkinson

Documents

1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004