Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor...

Data Standards and Statistical Issuesfor Immunogenetic Data

Richard M. Single

Associate Professor of Statistics

Department of Mathematics & Statistics

University of Vermont

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

HLA Nomenclature and why it matters

• Challenges for HLA data management and analysis– The HLA genes are very polymorphic;– HLA nomenclature is complicated;– There are multiple ways to generate HLA data;– All common typing systems generate ambiguous data;– There are multiple ways to report alleles and ambiguities;

These issues make meta-analyses of HLA data from

different sources very difficult.

= peptide fragment

HLA class I HLA class II

TCR = T-cell receptor

-m = microglobulin

Structure of HLA molecules

• HLA molecules are cell-surface proteins that present peptide fragments to T-cells• They bind specific sets of peptides based on structure

73 77 80

Ribbon drawing from Hedrick et al. PNAS, 88, 5897-5901

HLA-C binding pocket

DP DQ DR B C A

50 kb 850 kb 100 kb 1270 kb

class II loci class I loci

B1 A1 B1 A1 B1 A

400 kb 250 kb

16122211 1280

IMGT/HLA Database Release 3.12.0 April 17, 2013

HLA classical loci and polymorphism

Protein-level allele numbers:

HLA-A * 24 : 02 : 01 : 02 : L

Locus Field 1 (2-Digit)

Serological level(where possible)

Field 2 (4-Digit)

Peptide level(amino acid difference)

Field 3(6-Digit)

Nucleotide level[silent]

(synonymous substitutions)

Field 4(8-Digit)

Intron level (3’ or 5’

polymorphism)

ExpressionN = nullL = lowS = soluble…

• For most analyses, we want to distinguish among unique peptide sequences, i.e., 2 fields (“4-digit”) level

• This level of resolution treats alleles with the same peptide sequence for exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“binning” alleles]

HLA Allele Nomenclature

• HLA alleles are defined by a “patchwork” of sequence-level polymorphisms.

• Most typing systems do not interrogate the same set of polymorphisms

- e.g., DRB1*14:01:01 vs. *14:54 differ only in exon 3

• There is currently no simple way to identify which alleles could (could not)

have been detected by a given typing system.

HLA Nomenclature & Polymorphism

Distinctive Geographical Distribution of subtypes of HLA-DRB1*08

Outline

Data Standardization to facilitate Meta-analyses

Data standardization methods …

• Document the typing method (SSOP, SSP, SBT, …), version, exons interrogated,

and the set of detectable alleles:

• Perform data validation by checking against IMGT & IPD-KIR allele lists

allow re-evaluation of raw data in future contexts

allow information/results to be combined across datasets more easily

Extending STREGA to Immunogenomic Studies

• The STrengthening the REporting of Genetic Association studies (STREGA) statement provides community-based data reporting and analysis standards for genomic disease association studies

• The IDAWG (immunogenomics.org) has proposed an extension of STREGA: STrengthening the REporting of Immunogenomic Studies (STREIS)

From STREGA to STREIS

Extensions to the STREGA guidelines for immunogenomic data include:

• Describing the system(s) used to store, manage, and validate genotype and allele data

• Documenting all methods applied to resolve ambiguity • Defining any codes used to represent ambiguities• Describing any binning or combining of alleles into common categories• Avoiding the use of subjective terms (e.g. high-resolution typing), that

may change over time

Outline

Allele-level Ambiguity

Group codes (“g”-codes) for alleles identical in exons 2 & 3 for class I, or exon 2 for class II.

A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”

NMDP ambiguity codes for 4-digit non-null alleles

A*0201/0209 = A*02AFA*0201/0209/0266 = A*02AJEYA*0201/0209/0266/0275/0289 = A*02BSFJ

Ambiguous allele sets A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289

Ambiguous alleles result from polymorphisms outside of assessed regions; • outside of exons 2 & 3, or • in sections of those exons that were not interrogated.

Genotype-level Ambiguity

Ambiguous genotypes result from an inability to establish the phase of individual polymorphisms or entire exons.

Different combinations of alleles can lead to the same typing result.

Example: A typing result for one individual that could be explained by any of four different possible genotype sets at HLA-B.

Genotype 1 2705 4402Genotype 2 2705 4411Genotype 3 2709 4402Genotype 4 2709 4411

B*2705 + B*4402 or B*2705 + B*4411 or B*2709 + B*4402 or B*2709 + B*4411

Most analytical methods require a single genotype call for each individual sample.

Standardized Ambiguity Reduction

2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717

44020101, 44020102S, 440203, 4419N, 4423N, 4424, 4427, 4433

2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717

440202, 4411

2709 44020101, 44020102S, 440203, 4419N, 4423N, 4424, 4427, 4433

2709 440202, 4411

HLA-B allele 1 HLA-B allele 2

Genotype 1

Genotype 2

Genotype 3

Genotype 4

Sample #001

Peptide-level Filtering, Remove non-CWD alleles,

Binning alleles identical over exons 2&3

Unambiguous data

2703, 2705 4402

Regional population-level frequency data

xxx2703, 2705 4402

2705 4402

immunogenomics.org

Genotype List (GL) Strings

• Use a hierarchical set of operators to describe the relationships between – alleles, lists of possible alleles, phased alleles, genotypes, lists of

possible genotypes, and multilocus unphased genotypes, – without losing typing information or increasing typing ambiguity.

• Are proposed to replace NMDP codes

Milius et al. (2013) Tissue Antigens

Genotype List (GL) Strings

• Example GL string for the genotype:

A*02:69 + A*23:30 or A*02:302 + A*23:26 or A*02:302 + A*23:39

B*44:02 + B*49:08and

• Immunology Database and Analysis Portal (www.ImmPort.org) Developed under the Bioinformatics Integration Support Contract (BISC) for NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)

– Data validation pipeline– Analysis tools– Standardized ambiguity reduction tools – Data from a large number of immunogenomic studies

• ImmunoGenomics Data Analysis Working Group (www.immunogenomics.org) (www.IgDAWG.org)

An international collaborative group working to …– facilitate the sharing of immunogenomic data (HLA, KIR, etc.) and – foster consistent analysis and interpretation of immunogenomic data

Resources for HLA Data Validation & Analysis

Outline

• The KIR gene complex is located on Chromosome 19 (19q13.4)

• KIR are expressed on natural killer (NK) cells and a subset of T cells

• Certain HLA alleles serve as ligands for KIR

KIR Gene FunctionLigand2DL1 Inhibitory HLA-C group22DS1 Activating HLA-C group22DL2/3 Inhibitory HLA-C group1 2DS2 Activating HLA-C group13DL1 Inhibitory HLA-Bw43DS1 Activating HLA-Bw4

Killer cell Immunoglobulin-like Receptor (KIR)

NK Cell Normal Cell

No Lysis

Dominant inhibition

iKIR HLA

Act. rec.

Protection

ligand

Cytokines

Missing-self recognition

NK Cell

Act. rec.

Targetsligand

KIR regulate NK cell activity

HLA-C alleles can be divided into two groups based on the amino acid at position 80 (& 77),

which determines KIR recognition

Ser77Asp80

Cw1 Cw3 Cw7 Cw8 Cw12Cw13Cw14

HLA-C1

KIR2DL3/2DL2NK cell

inhibition

HLA-C2Asp77Lys80

Cw2 Cw4 Cw5 Cw6 Cw15Cw17

KIR2DL1

Bifurcation of HLA-B allotypes

Bw4 (40%) Bw6 (60%)

KIR3DL1 ligands

KIR3DS1

Not a ligand for KIR

80I 80T

Outline

KIR & HLA in 30 Global Populations

• Several studies hypothesized selection for KIR that suit the locale-specific HLA repertoire.

• Disease association studies point to HLA-Bw4 alleles with Isoleucine at position 80 (“Bw4-80I”) as the strongest ligand for KIR3DS1

Population-level evidence for Co-evolution & Natural Selection for KIR and HLA

KIR2DL3 vs. HLA-Cgroup1

r = 0.184

KIR3DL1 vs. HLA-Bw4

r = 0.426

KIR2DL1 vs. HLA-Cgroup2

r = 0.046

Inhibitory KIR

Correlations between frequencies for KIR and HLA Ligands

KIR3DS1 vs. HLA-Bw4

r = -0.632

KIR2DS1 vs. HLA-Cgroup2

r = -0.478

KIR2DS2 vs. HLA-Cgroup1

r = -0.371

Activating KIR

Correlations between frequencies for KIR and HLA Ligands

Activating KIR3DS1

Subsets of Bw4 alleles based on amino acid position 80

KIR3DS1 vs. HLA-Bw4

r = -0.632

KIR3DS1 vs. HLA-Bw4-80I

r = -0.657

KIR3DS1 vs. HLA-Bw4-80T

r = -0.190

Single et al., Nature Genetics

• Challenges for these and other population studies– Demographic history shapes patterns of variation & can mimic the

effects of selection. – Gene frequencies are not statistically independent among populations,

due to shared demographic history.

• Ordinary Pearson correlation p-values assume independence among the observations.

• We constructed a randomization test to account for the demographic histories of the populations and focus on the genetic effect.

Statistical Issues

Assessing the significance ρ = cor(X,Y)

• Null Hypothesis: H0: ρ = 0

• Statistic: Pearson’s correlation coefficient

Hypothesis Test for a Correlation Coefficient

.674observedr

X Y4.1 4.98.6 5.42.3 4.25.4 7.49.2 8.87.7 6.76.4 8.84.3 5.17.6 9.43.4 5.3

i ii i

x x y yr

x x y y

Randomization Test

Population HLA-B (1) HLA-B (2) B-grp (1) B-grp (2) HLA-C (1) HLA-C (2) C-grp (1) C-grp (2)Biaka 0702 1503 Bw6 Bw6 0202 0702 C2 C1Biaka 0702 4403 Bw6 Bw4 0401 0702 C2 C1Biaka 1302 3701 Bw4 Bw4 0202 0602 C2 C2Biaka 4901 5301 Bw4 Bw4 0401 0701 C2 C1Biaka 3701 3910 Bw4 Bw6 0202 1203 C2 C1

… … … … … … … … …

• Bw4 alleles: 1301, 1302, 1516, 1517, 2702, 2703, 2704, 2705, 3701, 3801, 3802, 4402, 4403, 4404, 4405, ...

• Bw6 alleles: 0702, 0705, 0799, 0801, 1401, 1402, 1403, 1501, 1502, 1503, 1504, 1506, 1507, 1508, 1510, ...

• Reassign Bw4/Bw6 status to simulate the null hypothesis• Compute correlation of frequencies for KIR-3DS1 & reassigned HLA

Permutation Distribution

correlation

-0.5 0.0 0.5

KIR3DS1 – HLA-Bw4 correlation

Permutation p-value=0.012

r = -0.632

• Empirical comparisons based on genomic data or other methods that incorporate information about the demographic histories of populations (Pritchard and Donnelly, 2001).

– Our study used data from the ALFRED database to assess statistical significance http://alfred.med.yale.edu

– We selected 538 neutral sites from 202 genes typed in the same individuals

Genomic Controls

Genomic Data

• Randomly select two SNP sites from different chromosomes • Find the frequencies in each population and compute the correlation• Repeat

Genomic Data for Empirical Tests

0.2 0.4 0.6 0.8 1.0

SNP site 1

Empirical Distribution for Correlations among unlinked SNPs

Correlation

-1.0 -0.5 0.0 0.5 1.0

KIR3DS1 – HLA-Bw4 correlation

empirical p-value=0.041

r = -0.632

Genomic Data – Empirical Distribution

* Ordinary Pearson p-values in red overestimate the significance of trends

locus pair Correlationp-value (1)

(correlation)p-value (2)

(permutation)p-value (3)

(empirical)

3DS1 - Bw4 -0.632 0.000 0.012 0.041

3DS1 - Bw480I -0.657 0.000 0.009 0.038

3DS1 - Bw480T -0.190 0.316 0.532 0.534

3DL1 - Bw4 0.426 0.019 0.106 0.218

3DL1 - Bw480I 0.416 0.022 0.115 0.191

3DL1 - Bw480T 0.171 0.367 0.540 0.758

2DS1 - C2 -0.478 0.008 0.243 0.149

2DL1 - C2 0.046 0.810 0.891 0.924

2DL2 - C1 -0.366 0.047 0.193 0.542

2DL3 - C1 0.184 0.331 0.458 0.328

2DS2 - C1 -0.371 0.044 0.170 0.479

(1) P-correlation is the ordinary Pearson product-moment correlation p-value.(2) P-permutation is based on the permutation distribution under the null hypothesis.(3) P-empirical is based on the empirical distribution for unlinked SNPs from ALFRED.

Significance of Correlations *

Outline

Acknowledgements

NCIMary CarringtonPat MartinGao Xiaojiang

USPDiogo MeyerRodrigo dos Santos Francisco

Yale UniversityKen and Judy Kidd

Children's Hospital Oakland Research Inst.Steven J. MackJill A. Hollenbach

Harvard Medical SchoolAlex Lancaster

UC San FranciscoOwen Solberg

Roche Molecular SystemsHenry A. Erlich

Anthony Nolan Research Inst.Steven G.E. Marsh

NCBI/NIHMike Feolo

NGITJeff WiserPatrick DunnTom Smith

If time allows …

iji ji j

D p q D

21 1 2

min( 1 1) min( 1 1)

ij i ji j LD

D p qX N

WI J I J

The two most common measures of the strength of LD are:

(1) the normalized measure of the individual LD values, namely Dij' = Dij / Dmax (Lewontin 1964); and

(2) the correlation coefficient r for bi-allelic data, which is most often reported as r2 = D2 / (pA1 pA2 pB1 pB2).

r =1 only when the allelic variations at the two loci show 100% correlation

Their multi-allelic extensions are:

Linkage Disequilibrium (LD) Measures

Standard LD measures D’ and Wn

Standard LD measures (overall D’ & Wn) assume/force symmetry, even though with >2 alleles per locus that is not the case

Data Source: Immport Study#SDY26: Identifying polymorphisms associated with risk for the development of myopericarditis following smallpox vaccine

Asymmetric Linkage Disequilibrium (ALD)

Interpretation:

ALD for HLA-DRB1 conditioning on HLA-DQA1 WDRB1 / DQA1 = .58

ALD for HLA-DQA1 conditioning on HLA-DRB1 WDQA1 / DRB1 = .95

The overall variation for DRB1 is relatively high given specific DQA1 alleles.

The overall variation for DQA1 is relatively low given specific DRB1 alleles.

ALDrow gene conditional on column gene

Thomson and Single, 2014 Genetics

• Balancing selection can result from:

- Overdominance/Heterozygote advantage- Frequency-dependent selection- Selective regimes that change over time/space

• For HLA, the common factor in these models is rare allele advantage, which is consistent with a pathogen-directed frequency-dependent selection model.

• At the Amino Acid (AA) level we see- High AA variability at antigen recognition sites (ARS)- Relatively even AA frequencies at ARS sites- Higher rates of non-synonymous vs. synonymous changes at ARS

Balancing Selection Operates at Most HLA Loci

Homozygosity (F) and theNormalized Deviate (Fnd)

allele

alleleal

Neutrality

FOBS ≈ FEQ

Fnd ≈ 0

Directional Selection

FOBS > FEQ

Fnd > 0

Balancing Selection

FOBS < FEQ

Fnd < 0

Fnd = (FOBS - FEQ) / SD(FEQ)

Fnd for DRB1 AA sites in a EUR population

• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.

LD for DRB1 AAs

Wn ALDrow gene conditional on column gene

Asymmetric LD (ALD)Wn (symmetric)

Fnd for DRB1 AA sites (Meta-Analysis)

Fnd for all polymorphic sites in a meta-analysis of 57 populations

• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.

Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures

Description

Definition of Measuresa 1. Single locus homozygosity (F)b

FA = i pAi

2 2. Haplotype specific homozygosity (HSF)c

FA/Bj = i (fij / pBj)

3. Overall weighted HSF valuesd FA/B (and FB/A)

FA/B = j (FA/Bj) (pBj) = FA + i j Dij

2 / pBj

4. Multi-allelic ALDe squared WA/B (and WB/A)

2 = (FA/B−FA) / (1−FA)

Thomson and Single(2014) Genetics

Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures

Description

Definition of Measuresa 1. Single locus homozygosity (F)b

FA = i pAi

2 2. Haplotype specific homozygosity (HSF)c

FA/Bj = i (fij / pBj)

3. Overall weighted HSF valuesd FA/B (and FB/A)

FA/B = j (FA/Bj) (pBj) = FA + i j Dij

2 / pBj

4. Multi-allelic ALDe squared WA/B (and WB/A)

2 = (FA/B−FA) / (1−FA)

If both loci are bi-allelic: WA/B

2 = [i j (Dij2 / pBj)] / (1 − FA) = D2 / (pA1 pA2 pB1 pB2) = r2, since D11= −D12= −D21= D22=D

Thomson and Single(2014) Genetics

Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor...

Documents

DATA & STATISTICS 101

Courses Days Engineering Statistics and Data … Days Engineering Statistics and Data ... data visualization and analysis, problem solving and ... Engineering Statistics and Data Analysis

Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data

Some of immunogenetic status of HBsAg negative, HBcAb positive

statistics - Data Spread

Immunogenetic Pathogenesis of Celiac Disease and … · Immunogenetic Pathogenesis of Celiac Disease and Non-celiac Gluten Sensitivity Celia Escudero-Hernández1 & Amado Salvador

The immunogenetic mechanism of renal injury George P Lai

Identifying Biologically Relevant Amino Acids in Immunogenetic Studies Richard M. Single Department of Mathematics and Statistics University of Vermont

Electricity Statistics data sources and methodologies · Electricity Statistics – data sources and ... Electricity Statistics – data sources and methodologies ... 3.1.2 Additional

Data & Statistics

Immunogenetic factors driving formation of ultralong VH ...vetmed.tamu.edu/media/2081974/deiss cmi 2017 immunogenetic factors... · RESEARCH ARTICLE Immunogenetic factors driving

California Dairy Statistics California Dairy Statistics · PDF fileDairy Statistics California 2010 Data California Dairy Statistics 2012 Data

Data Analysis Statistics. Inferential statistics

Statistics and Data Analysis - New York Universitypeople.stern.nyu.edu/wgreene/Statistics/Statistics-Problem1... · Assignment 1 1 Statistics and Data Analysis Professor William Greene

Type 1 diabetes in Africa: an immunogenetic study in the

Data and Statistics

Research Article Impact of Immunogenetic IL28B Polymorphism on Natural ...downloads.hindawi.com/journals/bmri/2014/710642.pdf · Impact of Immunogenetic IL28B Polymorphism on Natural

Immunogenetic studies of Guillain - Barré syndrome and chronic

Statistics = Data Science.pdf

California California Dairy Statistics 2011 Data Dairy ... · PDF fileDairy Statistics California 2010 Data2010 Data California Dairy Statistics 2011 Data