View
216
Download
0
Category
Preview:
Citation preview
Data Standards and Statistical Issuesfor Immunogenetic Data
Richard M. Single
Associate Professor of Statistics
Department of Mathematics & Statistics
University of Vermont
• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview
• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls
Outline
HLA Nomenclature and why it matters
MHC
HLA Nomenclature and why it matters
• Challenges for HLA data management and analysis– The HLA genes are very polymorphic;– HLA nomenclature is complicated;– There are multiple ways to generate HLA data;– All common typing systems generate ambiguous data;– There are multiple ways to report alleles and ambiguities;
These issues make meta-analyses of HLA data from
different sources very difficult.
TCR
= peptide fragment
-m
TCR
HLA class I HLA class II
TCR = T-cell receptor
-m = microglobulin
Structure of HLA molecules
• HLA molecules are cell-surface proteins that present peptide fragments to T-cells• They bind specific sets of peptides based on structure
7
90
73 77 80
Ribbon drawing from Hedrick et al. PNAS, 88, 5897-5901
HLA-C binding pocket
DP DQ DR B C A
50 kb 850 kb 100 kb 1270 kb
class II loci class I loci
B1 A1 B1 A1 B1 A
400 kb 250 kb
16122211 1280
2980
31216
19153
IMGT/HLA Database Release 3.12.0 April 17, 2013
HLA classical loci and polymorphism
Protein-level allele numbers:
HLA-A * 24 : 02 : 01 : 02 : L
Locus Field 1 (2-Digit)
Serological level(where possible)
Field 2 (4-Digit)
Peptide level(amino acid difference)
Field 3(6-Digit)
Nucleotide level[silent]
(synonymous substitutions)
Field 4(8-Digit)
Intron level (3’ or 5’
polymorphism)
ExpressionN = nullL = lowS = soluble…
• For most analyses, we want to distinguish among unique peptide sequences, i.e., 2 fields (“4-digit”) level
• This level of resolution treats alleles with the same peptide sequence for exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“binning” alleles]
HLA Allele Nomenclature
• HLA alleles are defined by a “patchwork” of sequence-level polymorphisms.
• Most typing systems do not interrogate the same set of polymorphisms
- e.g., DRB1*14:01:01 vs. *14:54 differ only in exon 3
• There is currently no simple way to identify which alleles could (could not)
have been detected by a given typing system.
HLA Nomenclature & Polymorphism
Distinctive Geographical Distribution of subtypes of HLA-DRB1*08
• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview
• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls
Outline
Data Standardization to facilitate Meta-analyses
Data standardization methods …
• Document the typing method (SSOP, SSP, SBT, …), version, exons interrogated,
and the set of detectable alleles:
• Perform data validation by checking against IMGT & IPD-KIR allele lists
allow re-evaluation of raw data in future contexts
allow information/results to be combined across datasets more easily
Extending STREGA to Immunogenomic Studies
• The STrengthening the REporting of Genetic Association studies (STREGA) statement provides community-based data reporting and analysis standards for genomic disease association studies
• The IDAWG (immunogenomics.org) has proposed an extension of STREGA: STrengthening the REporting of Immunogenomic Studies (STREIS)
From STREGA to STREIS
Extensions to the STREGA guidelines for immunogenomic data include:
• Describing the system(s) used to store, manage, and validate genotype and allele data
• Documenting all methods applied to resolve ambiguity • Defining any codes used to represent ambiguities• Describing any binning or combining of alleles into common categories• Avoiding the use of subjective terms (e.g. high-resolution typing), that
may change over time
• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview
• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls
Outline
Allele-level Ambiguity
Group codes (“g”-codes) for alleles identical in exons 2 & 3 for class I, or exon 2 for class II.
A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”
NMDP ambiguity codes for 4-digit non-null alleles
A*0201/0209 = A*02AFA*0201/0209/0266 = A*02AJEYA*0201/0209/0266/0275/0289 = A*02BSFJ
Ambiguous allele sets A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289
Ambiguous alleles result from polymorphisms outside of assessed regions; • outside of exons 2 & 3, or • in sections of those exons that were not interrogated.
Genotype-level Ambiguity
Ambiguous genotypes result from an inability to establish the phase of individual polymorphisms or entire exons.
Different combinations of alleles can lead to the same typing result.
Example: A typing result for one individual that could be explained by any of four different possible genotype sets at HLA-B.
Genotype 1 2705 4402Genotype 2 2705 4411Genotype 3 2709 4402Genotype 4 2709 4411
B*2705 + B*4402 or B*2705 + B*4411 or B*2709 + B*4402 or B*2709 + B*4411
Most analytical methods require a single genotype call for each individual sample.
Standardized Ambiguity Reduction
2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717
44020101, 44020102S, 440203, 4419N, 4423N, 4424, 4427, 4433
2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717
440202, 4411
2709 44020101, 44020102S, 440203, 4419N, 4423N, 4424, 4427, 4433
2709 440202, 4411
HLA-B allele 1 HLA-B allele 2
Genotype 1
Genotype 2
Genotype 3
Genotype 4
Sample #001
Peptide-level Filtering, Remove non-CWD alleles,
Binning alleles identical over exons 2&3
Unambiguous data
2703, 2705 4402
Regional population-level frequency data
xxx2703, 2705 4402
2705 4402
immunogenomics.org
Genotype List (GL) Strings
• Use a hierarchical set of operators to describe the relationships between – alleles, lists of possible alleles, phased alleles, genotypes, lists of
possible genotypes, and multilocus unphased genotypes, – without losing typing information or increasing typing ambiguity.
• Are proposed to replace NMDP codes
Milius et al. (2013) Tissue Antigens
Genotype List (GL) Strings
• Example GL string for the genotype:
A*02:69 + A*23:30 or A*02:302 + A*23:26 or A*02:302 + A*23:39
B*44:02 + B*49:08and
• Immunology Database and Analysis Portal (www.ImmPort.org) Developed under the Bioinformatics Integration Support Contract (BISC) for NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)
– Data validation pipeline– Analysis tools– Standardized ambiguity reduction tools – Data from a large number of immunogenomic studies
• ImmunoGenomics Data Analysis Working Group (www.immunogenomics.org) (www.IgDAWG.org)
An international collaborative group working to …– facilitate the sharing of immunogenomic data (HLA, KIR, etc.) and – foster consistent analysis and interpretation of immunogenomic data
Resources for HLA Data Validation & Analysis
• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview
• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls
Outline
• The KIR gene complex is located on Chromosome 19 (19q13.4)
• KIR are expressed on natural killer (NK) cells and a subset of T cells
• Certain HLA alleles serve as ligands for KIR
KIR Gene FunctionLigand2DL1 Inhibitory HLA-C group22DS1 Activating HLA-C group22DL2/3 Inhibitory HLA-C group1 2DS2 Activating HLA-C group13DL1 Inhibitory HLA-Bw43DS1 Activating HLA-Bw4
Killer cell Immunoglobulin-like Receptor (KIR)
NK Cell Normal Cell
No Lysis
Dominant inhibition
iKIR HLA
Act. rec.
Protection
ligand
Lysis
Cytokines
Missing-self recognition
NK Cell
iKIR
Act. rec.
HIV+
Targetsligand
KIR regulate NK cell activity
HLA-C alleles can be divided into two groups based on the amino acid at position 80 (& 77),
which determines KIR recognition
Ser77Asp80
Cw1 Cw3 Cw7 Cw8 Cw12Cw13Cw14
HLA-C1
KIR2DL3/2DL2NK cell
inhibition
HLA-C2Asp77Lys80
Cw2 Cw4 Cw5 Cw6 Cw15Cw17
KIR2DL1
Bifurcation of HLA-B allotypes
HLA-B
Bw4 (40%) Bw6 (60%)
KIR3DL1 ligands
KIR3DS1
Not a ligand for KIR
80I 80T
• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview
• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls
Outline
KIR & HLA in 30 Global Populations
• Several studies hypothesized selection for KIR that suit the locale-specific HLA repertoire.
• Disease association studies point to HLA-Bw4 alleles with Isoleucine at position 80 (“Bw4-80I”) as the strongest ligand for KIR3DS1
Population-level evidence for Co-evolution & Natural Selection for KIR and HLA
KIR2DL3 vs. HLA-Cgroup1
r = 0.184
KIR3DL1 vs. HLA-Bw4
r = 0.426
KIR2DL1 vs. HLA-Cgroup2
r = 0.046
Inhibitory KIR
Correlations between frequencies for KIR and HLA Ligands
Correlations between frequencies for KIR and HLA Ligands
KIR3DS1 vs. HLA-Bw4
r = -0.632
KIR2DS1 vs. HLA-Cgroup2
r = -0.478
KIR2DS2 vs. HLA-Cgroup1
r = -0.371
Activating KIR
Correlations between frequencies for KIR and HLA Ligands
Activating KIR3DS1
Subsets of Bw4 alleles based on amino acid position 80
KIR3DS1 vs. HLA-Bw4
r = -0.632
KIR3DS1 vs. HLA-Bw4-80I
r = -0.657
KIR3DS1 vs. HLA-Bw4-80T
r = -0.190
Single et al., Nature Genetics
• Challenges for these and other population studies– Demographic history shapes patterns of variation & can mimic the
effects of selection. – Gene frequencies are not statistically independent among populations,
due to shared demographic history.
• Ordinary Pearson correlation p-values assume independence among the observations.
• We constructed a randomization test to account for the demographic histories of the populations and focus on the genetic effect.
Statistical Issues
Assessing the significance ρ = cor(X,Y)
• Null Hypothesis: H0: ρ = 0
• Statistic: Pearson’s correlation coefficient
Hypothesis Test for a Correlation Coefficient
.674observedr
X Y4.1 4.98.6 5.42.3 4.25.4 7.49.2 8.87.7 6.76.4 8.84.3 5.17.6 9.43.4 5.3
2 2
i ii
i ii i
x x y yr
x x y y
Randomization Test
Population HLA-B (1) HLA-B (2) B-grp (1) B-grp (2) HLA-C (1) HLA-C (2) C-grp (1) C-grp (2)Biaka 0702 1503 Bw6 Bw6 0202 0702 C2 C1Biaka 0702 4403 Bw6 Bw4 0401 0702 C2 C1Biaka 1302 3701 Bw4 Bw4 0202 0602 C2 C2Biaka 4901 5301 Bw4 Bw4 0401 0701 C2 C1Biaka 3701 3910 Bw4 Bw6 0202 1203 C2 C1
… … … … … … … … …
• Bw4 alleles: 1301, 1302, 1516, 1517, 2702, 2703, 2704, 2705, 3701, 3801, 3802, 4402, 4403, 4404, 4405, ...
• Bw6 alleles: 0702, 0705, 0799, 0801, 1401, 1402, 1403, 1501, 1502, 1503, 1504, 1506, 1507, 1508, 1510, ...
• Reassign Bw4/Bw6 status to simulate the null hypothesis• Compute correlation of frequencies for KIR-3DS1 & reassigned HLA
Permutation Distribution
correlation
De
nsi
ty
-0.5 0.0 0.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
XX
KIR3DS1 – HLA-Bw4 correlation
Permutation p-value=0.012
r = -0.632
• Empirical comparisons based on genomic data or other methods that incorporate information about the demographic histories of populations (Pritchard and Donnelly, 2001).
– Our study used data from the ALFRED database to assess statistical significance http://alfred.med.yale.edu
– We selected 538 neutral sites from 202 genes typed in the same individuals
Genomic Controls
Genomic Data
• Randomly select two SNP sites from different chromosomes • Find the frequencies in each population and compute the correlation• Repeat
Genomic Data for Empirical Tests
0.2 0.4 0.6 0.8 1.0
0.3
0.4
0.5
0.6
0.7
0.8
SNP site 1
SN
P s
ite 2
Empirical Distribution for Correlations among unlinked SNPs
Correlation
Densi
ty
-1.0 -0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
XX
KIR3DS1 – HLA-Bw4 correlation
empirical p-value=0.041
r = -0.632
Genomic Data – Empirical Distribution
* Ordinary Pearson p-values in red overestimate the significance of trends
locus pair Correlationp-value (1)
(correlation)p-value (2)
(permutation)p-value (3)
(empirical)
3DS1 - Bw4 -0.632 0.000 0.012 0.041
3DS1 - Bw480I -0.657 0.000 0.009 0.038
3DS1 - Bw480T -0.190 0.316 0.532 0.534
3DL1 - Bw4 0.426 0.019 0.106 0.218
3DL1 - Bw480I 0.416 0.022 0.115 0.191
3DL1 - Bw480T 0.171 0.367 0.540 0.758
2DS1 - C2 -0.478 0.008 0.243 0.149
2DL1 - C2 0.046 0.810 0.891 0.924
2DL2 - C1 -0.366 0.047 0.193 0.542
2DL3 - C1 0.184 0.331 0.458 0.328
2DS2 - C1 -0.371 0.044 0.170 0.479
(1) P-correlation is the ordinary Pearson product-moment correlation p-value.(2) P-permutation is based on the permutation distribution under the null hypothesis.(3) P-empirical is based on the empirical distribution for unlinked SNPs from ALFRED.
Significance of Correlations *
• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview
• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls
Outline
Acknowledgements
NCIMary CarringtonPat MartinGao Xiaojiang
USPDiogo MeyerRodrigo dos Santos Francisco
Yale UniversityKen and Judy Kidd
Children's Hospital Oakland Research Inst.Steven J. MackJill A. Hollenbach
Harvard Medical SchoolAlex Lancaster
UC San FranciscoOwen Solberg
Roche Molecular SystemsHenry A. Erlich
Anthony Nolan Research Inst.Steven G.E. Marsh
NCBI/NIHMike Feolo
NGITJeff WiserPatrick DunnTom Smith
If time allows …
1 1
I J
iji ji j
D p q D
12
12
2
21 1 2
min( 1 1) min( 1 1)
I J
ij i ji j LD
n
D p qX N
WI J I J
The two most common measures of the strength of LD are:
(1) the normalized measure of the individual LD values, namely Dij' = Dij / Dmax (Lewontin 1964); and
(2) the correlation coefficient r for bi-allelic data, which is most often reported as r2 = D2 / (pA1 pA2 pB1 pB2).
r =1 only when the allelic variations at the two loci show 100% correlation
Their multi-allelic extensions are:
Linkage Disequilibrium (LD) Measures
Standard LD measures D’ and Wn
Standard LD measures (overall D’ & Wn) assume/force symmetry, even though with >2 alleles per locus that is not the case
Data Source: Immport Study#SDY26: Identifying polymorphisms associated with risk for the development of myopericarditis following smallpox vaccine
Asymmetric Linkage Disequilibrium (ALD)
Interpretation:
ALD for HLA-DRB1 conditioning on HLA-DQA1 WDRB1 / DQA1 = .58
ALD for HLA-DQA1 conditioning on HLA-DRB1 WDQA1 / DRB1 = .95
The overall variation for DRB1 is relatively high given specific DQA1 alleles.
The overall variation for DQA1 is relatively low given specific DRB1 alleles.
ALDrow gene conditional on column gene
Thomson and Single, 2014 Genetics
• Balancing selection can result from:
- Overdominance/Heterozygote advantage- Frequency-dependent selection- Selective regimes that change over time/space
• For HLA, the common factor in these models is rare allele advantage, which is consistent with a pathogen-directed frequency-dependent selection model.
• At the Amino Acid (AA) level we see- High AA variability at antigen recognition sites (ARS)- Relatively even AA frequencies at ARS sites- Higher rates of non-synonymous vs. synonymous changes at ARS
Balancing Selection Operates at Most HLA Loci
Homozygosity (F) and theNormalized Deviate (Fnd)
0
0.05
0.1
0.15
0.2
0.25
0.3
allele
alle
le fr
eque
ncy
0
0.1
0.2
0.3
0.4
0.5
0.6
allele
alle
le fr
eque
ncy
0
0.02
0.04
0.06
0.08
0.1
0.12
alleleal
lele
freq
uenc
y
Neutrality
FOBS ≈ FEQ
Fnd ≈ 0
Directional Selection
FOBS > FEQ
Fnd > 0
Balancing Selection
FOBS < FEQ
Fnd < 0
2
1
k
iiF p
Fnd = (FOBS - FEQ) / SD(FEQ)
Fnd for DRB1 AA sites in a EUR population
• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.
LD for DRB1 AAs
Wn ALDrow gene conditional on column gene
Asymmetric LD (ALD)Wn (symmetric)
Fnd for DRB1 AA sites (Meta-Analysis)
Fnd for all polymorphic sites in a meta-analysis of 57 populations
• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.
Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures
Description
Definition of Measuresa 1. Single locus homozygosity (F)b
FA = i pAi
2 2. Haplotype specific homozygosity (HSF)c
FA/Bj = i (fij / pBj)
2
3. Overall weighted HSF valuesd FA/B (and FB/A)
FA/B = j (FA/Bj) (pBj) = FA + i j Dij
2 / pBj
4. Multi-allelic ALDe squared WA/B (and WB/A)
WA/B
2 = (FA/B−FA) / (1−FA)
Thomson and Single(2014) Genetics
Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures
Description
Definition of Measuresa 1. Single locus homozygosity (F)b
FA = i pAi
2 2. Haplotype specific homozygosity (HSF)c
FA/Bj = i (fij / pBj)
2
3. Overall weighted HSF valuesd FA/B (and FB/A)
FA/B = j (FA/Bj) (pBj) = FA + i j Dij
2 / pBj
4. Multi-allelic ALDe squared WA/B (and WB/A)
WA/B
2 = (FA/B−FA) / (1−FA)
If both loci are bi-allelic: WA/B
2 = [i j (Dij2 / pBj)] / (1 − FA) = D2 / (pA1 pA2 pB1 pB2) = r2, since D11= −D12= −D21= D22=D
Thomson and Single(2014) Genetics
Recommended