74
Machine Learning for Biomarker Discovery Karsten Borgwardt ETH Z¨ urich, Department Biosystems Workshop with Huawei, May 25, 2018 Department Biosystems

Karsten Borgwardt ETH Zurich, Department Biosystems

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning for Biomarker Discovery

Karsten Borgwardt

ETH Zurich, Department Biosystems Workshop with Huawei, May 25, 2018

Department Biosystems

Page 2: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning and Personalized Medicine

Goals

Machine Learning tries to detect statistical dependencies in large datasets.

Personalized Medicine tries to exploit wealth of health data for improved diagnosis,prognosis and therapy decisions, tailored to the properties of each patient.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 2 / 46

Page 3: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning and Personalized Medicine

Goals

Machine Learning tries to detect statistical dependencies in large datasets.

Personalized Medicine tries to exploit wealth of health data for improved diagnosis,prognosis and therapy decisions, tailored to the properties of each patient.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 2 / 46

Page 4: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning and Personalized Medicine

Goals

Machine Learning tries to detect statistical dependencies in large datasets.

Personalized Medicine tries to exploit wealth of health data for improved diagnosis,prognosis and therapy decisions, tailored to the properties of each patient.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 2 / 46

Page 5: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning in Medicine

Key Topics

Automation of diagnoses

Biomarker discovery

1 A new technique:Combinatorial AssociationMapping

Biomedical data management

2 Software development for theLife Sciences: easyGWAS

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 3 / 46

Page 6: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning in Medicine

Key Topics

Automation of diagnoses

Biomarker discovery

1 A new technique:Combinatorial AssociationMapping

Biomedical data management

2 Software development for theLife Sciences: easyGWAS

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 3 / 46

Page 7: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning in Medicine

Key Topics

Automation of diagnoses

Biomarker discovery

1 A new technique:Combinatorial AssociationMapping

Biomedical data management

2 Software development for theLife Sciences: easyGWAS

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 3 / 46

Page 8: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning in Medicine

Key Topics

Automation of diagnoses

Biomarker discovery

1 A new technique:Combinatorial AssociationMapping

Biomedical data management

2 Software development for theLife Sciences: easyGWAS

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 3 / 46

Page 9: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning in Medicine

Key Topics

Automation of diagnoses

Biomarker discovery

1 A new technique:Combinatorial AssociationMapping

Biomedical data management

2 Software development for theLife Sciences: easyGWAS

ATGCATGCATGCATCACCATGCATGCTAGCTACG

ATGCAGGCATGCATCCCCATGCATGCTAGCGACG

ATGCATGCATGCATCACCATGCATGCTAGCGACG

ATGCAGGCATGCATCACCATGCATGCTAGCTACG

ATGCATGCATGCATCACCATGCATGCTAGCGACG

Sequence variationSignificant association betweenphenotype and genotype?

Disease

0

0

0

1

1

GenotypeIndividual 1

Individual 2

Individual 3

Individual 4

Individual 5

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 3 / 46

Page 10: Karsten Borgwardt ETH Zurich, Department Biosystems

Machine Learning in Medicine

Key Topics

Automation of diagnoses

Biomarker discovery

1 A new technique:Combinatorial AssociationMapping

Biomedical data management

2 Software development for theLife Sciences: easyGWAS

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 3 / 46

Page 11: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 4 / 46

Page 12: Karsten Borgwardt ETH Zurich, Department Biosystems

Association Mapping: Mapping Phenotypes to the Genome

ATGCATGCATGCATCACCATGCATGCTAGCTACG

ATGCAGGCATGCATCCCCATGCATGCTAGCGACG

ATGCATGCATGCATCACCATGCATGCTAGCGACG

ATGCAGGCATGCATCACCATGCATGCTAGCTACG

ATGCATGCATGCATCACCATGCATGCTAGCGACG

Sequence variationSignificant association betweenphenotype and genotype?

Disease

0

0

0

1

1

GenotypeIndividual 1

Individual 2

Individual 3

Individual 4

Individual 5

A genome-wide association study (GWAS) examines whether variation in the genome (inform of single nucleotide polymorphisms, SNPs) correlates with variation in thephenotype.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 5 / 46

Page 13: Karsten Borgwardt ETH Zurich, Department Biosystems

Association Mapping: Missing Heritability

Since 2001: More than 59,000 trait-related loci from GWAS (GWAS catalog, March 6, 2018)

Problem: Phenotypic variance explained still disappointingly low

REVIEWS

Finding the missing heritability of complexdiseasesTeri A. Manolio1, Francis S. Collins2, Nancy J. Cox3, David B. Goldstein4, Lucia A. Hindorff5, David J. Hunter6,Mark I. McCarthy7, Erin M. Ramos5, Lon R. Cardon8, Aravinda Chakravarti9, Judy H. Cho10, Alan E. Guttmacher1,Augustine Kong11, Leonid Kruglyak12, Elaine Mardis13, Charles N. Rotimi14, Montgomery Slatkin15, David Valle9,Alice S. Whittemore16, Michael Boehnke17, Andrew G. Clark18, Evan E. Eichler19, Greg Gibson20, Jonathan L. Haines21,Trudy F. C. Mackay22, Steven A. McCarroll23 & Peter M. Visscher24

Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases andtraits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relativelysmall increments in risk, and explain only a small proportion of familial clustering, leading many to question how theremaining, ‘missing’ heritability can be explained. Here we examine potential sources of missing heritability and proposeresearch strategies, including and extending beyond current genome-wide association approaches, to illuminate the geneticsof complex diseases and enhance its potential to enable effective disease prevention or treatment.

Many common human diseases and traits are known tocluster in families and are believed to be influenced byseveral genetic and environmental factors, but untilrecently the identification of genetic variants contributing

to these ‘complex diseases’ has been slow and arduous1. Genome-wideassociation studies (GWAS), in which several hundred thousand tomore than a million single nucleotide polymorphisms (SNPs) areassayed in thousands of individuals, represent a powerful new tool forinvestigating the genetic architecture of complex diseases1,2. In the pastfew years, these studies have identified hundreds of genetic variantsassociated with such conditions and have provided valuable insightsinto the complexities of their genetic architecture3,4.

The genome-wide association (GWA) method represents animportant advance compared to ‘candidate gene’ studies, in whichsample sizes are generally smaller and the variants assayed are limitedto a selected few, often on the basis of imperfect understanding ofbiological pathways and often yielding associations that are difficultto replicate5,6. GWAS are also an important step beyond family-basedlinkage studies, in which inheritance patterns are related to severalhundreds to thousands of genomic markers. Despite many clearsuccesses in single-gene ‘Mendelian’ disorders7,8, the limited successof linkage studies in complex diseases has been attributed to their lowpower and resolution for variants of modest effect9–11.

The underlying rationale for GWAS is the ‘common disease,common variant’ hypothesis, positing that common diseases areattributable in part to allelic variants present in more than 1–5% ofthe population12–14. They have been facilitated by the development ofcommercial ‘SNP chips’ or arrays that capture most, although not all,common variation in the genome. Although the allelic architecture ofsome conditions, notably age-related macular degeneration, for themost part reflects the contributions of several variants of large effect(defined loosely here as those increasing disease risk by twofold ormore), most common variants individually or in combination conferrelatively small increments in risk (1.1–1.5-fold) and explain only asmall proportion of heritability—the portion of phenotypic variancein a population attributable to additive genetic factors3. For example,at least 40 loci have been associated with human height, a classiccomplex trait with an estimated heritability of about 80%, yet theyexplain only about 5% of phenotypic variance despite studies of tensof thousands of people15. Although disease-associated variants occurmore frequently in protein-coding regions than expected from theirrepresentation on genotyping arrays, in which over-representation ofcommon and functional variants may introduce analytical biases, thevast majority (.80%) of associated variants fall outside codingregions, emphasizing the importance of including both coding andnon-coding regions in the search for disease-associated variants3.

1National Human Genome Research Institute, Building 31, Room 4B09, 31 Center Drive, MSC 2152, Bethesda, Maryland 20892-2152, USA. 2National Institutes of Health, Building 1,Room 126, MSC 0148, Bethesda, Maryland 20892-0148, USA. 3Departments of Medicine and Human Genetics, University of Chicago, Room A612, MC 6091, 5841 South MarylandAvenue, Chicago, Illinois 60637, USA. 4Duke University, The Institute for Genome Sciences and Policy (IGSP), Box 91009, Durham, North Carolina 27708, USA. 5National HumanGenome Research Institute, Office of Population Genomics, Suite 4076, MSC 9305, 5635 Fishers Lane, Rockville, Maryland 20892-9305, USA. 6Department of Epidemiology, HarvardSchool of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, USA. 7University of Oxford, Oxford Centre for Diabetes, Endocrinology and Metabolism, ChurchillHospital, Old Road, Oxford OX3 7LJ, UK, and Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. 8GlaxoSmithKline, 709Swedeland Road, King of Prussia, Pennsylvania 19406, USA. 9McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 North BroadwayBRB579, Baltimore, Maryland 21205, USA. 10Yale University, Department of Medicine, Division of Digestive Diseases, 333 Cedar Street, New Haven, Connecticut 06520-8019, USA.11deCODE Genetics, Sturlugata 8, Reykjavik IS-101, Iceland. 12Lewis-Sigler Institute for Integrative Genomics, Howard Hughes Medical Institute, and Department of Ecology andEvolutionary Biology, Princeton University, Princeton, New Jersey 08544, USA. 13The Genome Center, Washington University School of Medicine, 4444 Forest Park Avenue, CampusBox 8501, Saint Louis, Missouri 63108, USA. 14National Human Genome Research Institute, Center for Research on Genomics and Global Health, Building 12A, Room 4047, 12 SouthDrive, MSC 5635, Bethesda, Maryland 20892-5635, USA. 15Department of Integrative Biology, University of California, 3060 Valley Life Science Building, Berkeley, California 94720-3140, USA. 16Stanford University, Health Research and Policy, Redwood Building, Room T204, 259 Campus Drive, Stanford, California 94305, USA. 17Department of Biostatistics,University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48109-2029, USA. 18Department of Molecular Biology and Genetics, 107 Biotechnology Building, CornellUniversity, Ithaca, New York 14853, USA. 19Howard Hughes Medical Institute and University of Washington, Department of Genome Sciences, 1705 North-East Pacific Street, FoegeBuilding, Box 355065, Seattle, Washington 98195-5065, USA. 20University of Queensland, School of Biological Sciences, Goddard Building, Saint Lucia Campus, Brisbane, Queensland4072, Australia. 21Vanderbilt University, Center for Human Genetics Research, 519 Light Hall, Nashville, Tennessee 37232-0700, USA. 22Department of Genetics, North CarolinaState University, Box 7614, Raleigh, North Carolina 27695, USA. 23Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, NRB 0330, Boston, Massachusetts02115, USA. 24Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia.

Vol 461j8 October 2009jdoi:10.1038/nature08494

747 Macmillan Publishers Limited. All rights reserved©2009

Potential reasons:

Small effect sizesIncomplete integration of important genetic, epigenetic or non-genetic properties

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 6 / 46

Page 14: Karsten Borgwardt ETH Zurich, Department Biosystems

Association Mapping: Missing Heritability

Since 2001: More than 59,000 trait-related loci from GWAS (GWAS catalog, March 6, 2018)

Problem: Phenotypic variance explained still disappointingly low

REVIEWS

Finding the missing heritability of complexdiseasesTeri A. Manolio1, Francis S. Collins2, Nancy J. Cox3, David B. Goldstein4, Lucia A. Hindorff5, David J. Hunter6,Mark I. McCarthy7, Erin M. Ramos5, Lon R. Cardon8, Aravinda Chakravarti9, Judy H. Cho10, Alan E. Guttmacher1,Augustine Kong11, Leonid Kruglyak12, Elaine Mardis13, Charles N. Rotimi14, Montgomery Slatkin15, David Valle9,Alice S. Whittemore16, Michael Boehnke17, Andrew G. Clark18, Evan E. Eichler19, Greg Gibson20, Jonathan L. Haines21,Trudy F. C. Mackay22, Steven A. McCarroll23 & Peter M. Visscher24

Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases andtraits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relativelysmall increments in risk, and explain only a small proportion of familial clustering, leading many to question how theremaining, ‘missing’ heritability can be explained. Here we examine potential sources of missing heritability and proposeresearch strategies, including and extending beyond current genome-wide association approaches, to illuminate the geneticsof complex diseases and enhance its potential to enable effective disease prevention or treatment.

Many common human diseases and traits are known tocluster in families and are believed to be influenced byseveral genetic and environmental factors, but untilrecently the identification of genetic variants contributing

to these ‘complex diseases’ has been slow and arduous1. Genome-wideassociation studies (GWAS), in which several hundred thousand tomore than a million single nucleotide polymorphisms (SNPs) areassayed in thousands of individuals, represent a powerful new tool forinvestigating the genetic architecture of complex diseases1,2. In the pastfew years, these studies have identified hundreds of genetic variantsassociated with such conditions and have provided valuable insightsinto the complexities of their genetic architecture3,4.

The genome-wide association (GWA) method represents animportant advance compared to ‘candidate gene’ studies, in whichsample sizes are generally smaller and the variants assayed are limitedto a selected few, often on the basis of imperfect understanding ofbiological pathways and often yielding associations that are difficultto replicate5,6. GWAS are also an important step beyond family-basedlinkage studies, in which inheritance patterns are related to severalhundreds to thousands of genomic markers. Despite many clearsuccesses in single-gene ‘Mendelian’ disorders7,8, the limited successof linkage studies in complex diseases has been attributed to their lowpower and resolution for variants of modest effect9–11.

The underlying rationale for GWAS is the ‘common disease,common variant’ hypothesis, positing that common diseases areattributable in part to allelic variants present in more than 1–5% ofthe population12–14. They have been facilitated by the development ofcommercial ‘SNP chips’ or arrays that capture most, although not all,common variation in the genome. Although the allelic architecture ofsome conditions, notably age-related macular degeneration, for themost part reflects the contributions of several variants of large effect(defined loosely here as those increasing disease risk by twofold ormore), most common variants individually or in combination conferrelatively small increments in risk (1.1–1.5-fold) and explain only asmall proportion of heritability—the portion of phenotypic variancein a population attributable to additive genetic factors3. For example,at least 40 loci have been associated with human height, a classiccomplex trait with an estimated heritability of about 80%, yet theyexplain only about 5% of phenotypic variance despite studies of tensof thousands of people15. Although disease-associated variants occurmore frequently in protein-coding regions than expected from theirrepresentation on genotyping arrays, in which over-representation ofcommon and functional variants may introduce analytical biases, thevast majority (.80%) of associated variants fall outside codingregions, emphasizing the importance of including both coding andnon-coding regions in the search for disease-associated variants3.

1National Human Genome Research Institute, Building 31, Room 4B09, 31 Center Drive, MSC 2152, Bethesda, Maryland 20892-2152, USA. 2National Institutes of Health, Building 1,Room 126, MSC 0148, Bethesda, Maryland 20892-0148, USA. 3Departments of Medicine and Human Genetics, University of Chicago, Room A612, MC 6091, 5841 South MarylandAvenue, Chicago, Illinois 60637, USA. 4Duke University, The Institute for Genome Sciences and Policy (IGSP), Box 91009, Durham, North Carolina 27708, USA. 5National HumanGenome Research Institute, Office of Population Genomics, Suite 4076, MSC 9305, 5635 Fishers Lane, Rockville, Maryland 20892-9305, USA. 6Department of Epidemiology, HarvardSchool of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, USA. 7University of Oxford, Oxford Centre for Diabetes, Endocrinology and Metabolism, ChurchillHospital, Old Road, Oxford OX3 7LJ, UK, and Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. 8GlaxoSmithKline, 709Swedeland Road, King of Prussia, Pennsylvania 19406, USA. 9McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 North BroadwayBRB579, Baltimore, Maryland 21205, USA. 10Yale University, Department of Medicine, Division of Digestive Diseases, 333 Cedar Street, New Haven, Connecticut 06520-8019, USA.11deCODE Genetics, Sturlugata 8, Reykjavik IS-101, Iceland. 12Lewis-Sigler Institute for Integrative Genomics, Howard Hughes Medical Institute, and Department of Ecology andEvolutionary Biology, Princeton University, Princeton, New Jersey 08544, USA. 13The Genome Center, Washington University School of Medicine, 4444 Forest Park Avenue, CampusBox 8501, Saint Louis, Missouri 63108, USA. 14National Human Genome Research Institute, Center for Research on Genomics and Global Health, Building 12A, Room 4047, 12 SouthDrive, MSC 5635, Bethesda, Maryland 20892-5635, USA. 15Department of Integrative Biology, University of California, 3060 Valley Life Science Building, Berkeley, California 94720-3140, USA. 16Stanford University, Health Research and Policy, Redwood Building, Room T204, 259 Campus Drive, Stanford, California 94305, USA. 17Department of Biostatistics,University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48109-2029, USA. 18Department of Molecular Biology and Genetics, 107 Biotechnology Building, CornellUniversity, Ithaca, New York 14853, USA. 19Howard Hughes Medical Institute and University of Washington, Department of Genome Sciences, 1705 North-East Pacific Street, FoegeBuilding, Box 355065, Seattle, Washington 98195-5065, USA. 20University of Queensland, School of Biological Sciences, Goddard Building, Saint Lucia Campus, Brisbane, Queensland4072, Australia. 21Vanderbilt University, Center for Human Genetics Research, 519 Light Hall, Nashville, Tennessee 37232-0700, USA. 22Department of Genetics, North CarolinaState University, Box 7614, Raleigh, North Carolina 27695, USA. 23Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, NRB 0330, Boston, Massachusetts02115, USA. 24Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia.

Vol 461j8 October 2009jdoi:10.1038/nature08494

747 Macmillan Publishers Limited. All rights reserved©2009

Potential reasons:Polygenic architectures of complex diseasesSmall effect sizesIncomplete integration of important genetic, epigenetic or non-genetic properties

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 6 / 46

Page 15: Karsten Borgwardt ETH Zurich, Department Biosystems

Association Mapping: Missing Heritability

Since 2001: More than 59,000 trait-related loci from GWAS (GWAS catalog, March 6, 2018)

Problem: Phenotypic variance explained still disappointingly low

REVIEWS

Finding the missing heritability of complexdiseasesTeri A. Manolio1, Francis S. Collins2, Nancy J. Cox3, David B. Goldstein4, Lucia A. Hindorff5, David J. Hunter6,Mark I. McCarthy7, Erin M. Ramos5, Lon R. Cardon8, Aravinda Chakravarti9, Judy H. Cho10, Alan E. Guttmacher1,Augustine Kong11, Leonid Kruglyak12, Elaine Mardis13, Charles N. Rotimi14, Montgomery Slatkin15, David Valle9,Alice S. Whittemore16, Michael Boehnke17, Andrew G. Clark18, Evan E. Eichler19, Greg Gibson20, Jonathan L. Haines21,Trudy F. C. Mackay22, Steven A. McCarroll23 & Peter M. Visscher24

Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases andtraits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relativelysmall increments in risk, and explain only a small proportion of familial clustering, leading many to question how theremaining, ‘missing’ heritability can be explained. Here we examine potential sources of missing heritability and proposeresearch strategies, including and extending beyond current genome-wide association approaches, to illuminate the geneticsof complex diseases and enhance its potential to enable effective disease prevention or treatment.

Many common human diseases and traits are known tocluster in families and are believed to be influenced byseveral genetic and environmental factors, but untilrecently the identification of genetic variants contributing

to these ‘complex diseases’ has been slow and arduous1. Genome-wideassociation studies (GWAS), in which several hundred thousand tomore than a million single nucleotide polymorphisms (SNPs) areassayed in thousands of individuals, represent a powerful new tool forinvestigating the genetic architecture of complex diseases1,2. In the pastfew years, these studies have identified hundreds of genetic variantsassociated with such conditions and have provided valuable insightsinto the complexities of their genetic architecture3,4.

The genome-wide association (GWA) method represents animportant advance compared to ‘candidate gene’ studies, in whichsample sizes are generally smaller and the variants assayed are limitedto a selected few, often on the basis of imperfect understanding ofbiological pathways and often yielding associations that are difficultto replicate5,6. GWAS are also an important step beyond family-basedlinkage studies, in which inheritance patterns are related to severalhundreds to thousands of genomic markers. Despite many clearsuccesses in single-gene ‘Mendelian’ disorders7,8, the limited successof linkage studies in complex diseases has been attributed to their lowpower and resolution for variants of modest effect9–11.

The underlying rationale for GWAS is the ‘common disease,common variant’ hypothesis, positing that common diseases areattributable in part to allelic variants present in more than 1–5% ofthe population12–14. They have been facilitated by the development ofcommercial ‘SNP chips’ or arrays that capture most, although not all,common variation in the genome. Although the allelic architecture ofsome conditions, notably age-related macular degeneration, for themost part reflects the contributions of several variants of large effect(defined loosely here as those increasing disease risk by twofold ormore), most common variants individually or in combination conferrelatively small increments in risk (1.1–1.5-fold) and explain only asmall proportion of heritability—the portion of phenotypic variancein a population attributable to additive genetic factors3. For example,at least 40 loci have been associated with human height, a classiccomplex trait with an estimated heritability of about 80%, yet theyexplain only about 5% of phenotypic variance despite studies of tensof thousands of people15. Although disease-associated variants occurmore frequently in protein-coding regions than expected from theirrepresentation on genotyping arrays, in which over-representation ofcommon and functional variants may introduce analytical biases, thevast majority (.80%) of associated variants fall outside codingregions, emphasizing the importance of including both coding andnon-coding regions in the search for disease-associated variants3.

1National Human Genome Research Institute, Building 31, Room 4B09, 31 Center Drive, MSC 2152, Bethesda, Maryland 20892-2152, USA. 2National Institutes of Health, Building 1,Room 126, MSC 0148, Bethesda, Maryland 20892-0148, USA. 3Departments of Medicine and Human Genetics, University of Chicago, Room A612, MC 6091, 5841 South MarylandAvenue, Chicago, Illinois 60637, USA. 4Duke University, The Institute for Genome Sciences and Policy (IGSP), Box 91009, Durham, North Carolina 27708, USA. 5National HumanGenome Research Institute, Office of Population Genomics, Suite 4076, MSC 9305, 5635 Fishers Lane, Rockville, Maryland 20892-9305, USA. 6Department of Epidemiology, HarvardSchool of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, USA. 7University of Oxford, Oxford Centre for Diabetes, Endocrinology and Metabolism, ChurchillHospital, Old Road, Oxford OX3 7LJ, UK, and Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. 8GlaxoSmithKline, 709Swedeland Road, King of Prussia, Pennsylvania 19406, USA. 9McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 North BroadwayBRB579, Baltimore, Maryland 21205, USA. 10Yale University, Department of Medicine, Division of Digestive Diseases, 333 Cedar Street, New Haven, Connecticut 06520-8019, USA.11deCODE Genetics, Sturlugata 8, Reykjavik IS-101, Iceland. 12Lewis-Sigler Institute for Integrative Genomics, Howard Hughes Medical Institute, and Department of Ecology andEvolutionary Biology, Princeton University, Princeton, New Jersey 08544, USA. 13The Genome Center, Washington University School of Medicine, 4444 Forest Park Avenue, CampusBox 8501, Saint Louis, Missouri 63108, USA. 14National Human Genome Research Institute, Center for Research on Genomics and Global Health, Building 12A, Room 4047, 12 SouthDrive, MSC 5635, Bethesda, Maryland 20892-5635, USA. 15Department of Integrative Biology, University of California, 3060 Valley Life Science Building, Berkeley, California 94720-3140, USA. 16Stanford University, Health Research and Policy, Redwood Building, Room T204, 259 Campus Drive, Stanford, California 94305, USA. 17Department of Biostatistics,University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48109-2029, USA. 18Department of Molecular Biology and Genetics, 107 Biotechnology Building, CornellUniversity, Ithaca, New York 14853, USA. 19Howard Hughes Medical Institute and University of Washington, Department of Genome Sciences, 1705 North-East Pacific Street, FoegeBuilding, Box 355065, Seattle, Washington 98195-5065, USA. 20University of Queensland, School of Biological Sciences, Goddard Building, Saint Lucia Campus, Brisbane, Queensland4072, Australia. 21Vanderbilt University, Center for Human Genetics Research, 519 Light Hall, Nashville, Tennessee 37232-0700, USA. 22Department of Genetics, North CarolinaState University, Box 7614, Raleigh, North Carolina 27695, USA. 23Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, NRB 0330, Boston, Massachusetts02115, USA. 24Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia.

Vol 461j8 October 2009jdoi:10.1038/nature08494

747 Macmillan Publishers Limited. All rights reserved©2009

Potential reasons:Polygenic architectures of complex diseases → EpistasisSmall effect sizesIncomplete integration of important genetic, epigenetic or non-genetic properties

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 6 / 46

Page 16: Karsten Borgwardt ETH Zurich, Department Biosystems

Association Mapping: Missing Heritability

Epistasis as a Potential Reason

Most current analyses neglect interactive effects between loci

Need for approaches for combinatorial association mapping

COMMENT

Why epistasis is important for tackling complexhuman disease geneticsTrudy FC Mackay1* and Jason H Moore2

Editorial summary

Epistasis has been dismissed by some as having littlerole in the genetic architecture of complex humandisease. The authors argue that this view is the resultof a misconception and explain why exploring epistasisis likely to be crucial to understanding and predictingcomplex disease.

What is epistasis?The goal of human genetics is to specify the genotype-phenotype map; that is, to understand how naturallyoccurring genetic variants jointly act to modulate diseaserisk. In a typical genome scan (for example, a genome-wide association study), the effect of each variant on thedisease trait of interest is interrogated one at a time. Theeffects of all variants are then summed to deduce thetotal amount of genetic variation explained by DNApolymorphisms that affect the trait. This additive modelof inheritance assumes that the effects of individual vari-ants are independent of the effects of other contributingloci (the genetic background). Epistasis occurs if the ef-fect of one variant affecting a complex trait depends onthe genotype of a second variant affecting the trait. Forexample, consider two loci (A, B), each with two alleles(A1, A2, B1, B2). Epistasis would occur, for example, if theA2A2B2B2 genotype had a high disease risk, but theeight other possible two-locus genotypes had no effecton risk. This is only one of many possible forms ofepistatic interactions between two loci.

Is there evidence for epistasis for quantitativetraits?Many human diseases and disease-related phenotypes(for example, blood pressure) are quantitative traits. Thatis, their variation is due to many interacting genetic loci,

* Correspondence: [email protected] of Biological Sciences, North Carolina State University, Raleigh,NC 27695, USAFull list of author information is available at the end of the article

and the effects of alleles at these loci are highly sensitiveto the environmental circumstances to which the indivi-duals are exposed. Quantitative variation in phenotypesand disease risk must result in part from the perturbationof highly dynamic, interconnected and non-linear net-works (for example, developmental, neural, transcrip-tional, metabolic and biochemical networks) by multiplegenetic variants [1]; thus, gene-gene interactions arelikely. Most evidence for epistatic interactions comesfrom studies in model organisms. In yeast, nematodesand flies, systematic screens for genetic interactions af-fecting fitness and quantitative traits have revealed theubiquity of epistasis [2]. Arguably, though, these inter-actions could be specific for the large phenotypic effectsof mutations and knockdown by RNA interference, notthe variants with more subtle effects that segregate innatural populations. However, studies mapping quantita-tive trait loci (QTLs) in model organisms have oftenfound QTL ×QTL interactions, even between QTLs thathave no significant effects when these are averaged overall genetic backgrounds. The ability to transfer genomicfragments (entire chromosomes or smaller intervals)between two inbred strains has further revealed perva-sive epistasis [3]. Finally, the effects of induced mutationsare highly variable in different genetic backgrounds, aphenomenon that can be used to map genes interactingwith the focal mutation [2]. If epistatic interactions areso common in ‘simple’ model organisms, it seems un-reasonable to assume that they do not occur in humans.

Why has epistasis been largely ignored in humangenetics?Historically, the genetic analysis of quantitative traits hasbeen purely statistical. The magnitude of variation in acomplex trait phenotype can be partitioned into threedifferent types of component: additive components,non-additive components (dominance and epistatic) andenvironmental variance components [4]. Most quantita-tive genetic variation is additive, and this has been usedto dismiss the relevance of epistasis [5]. However,

© 2014 Mackay and Moore; licensee BioMed Central Ltd. The licensee has exclusive rights to distribute this article, in anymedium, for 12 months following its publication. After this time, the article is available under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Mackay and Moore Genome Medicine 2014, 6:42http://genomemedicine.com/content/6/6/42

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 7 / 46

Page 17: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping

Cas

esC

on

tro

ls

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

SNP set

0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

Computational challenge: Combinatorial explosion of the number of candidate sets

Statistical challenge: Combinatorial explosion of the number of association tests

Concrete example: Even for pairs of 106 features, order 1012 hypotheses!Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 8 / 46

Page 18: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping

Multiple Hypothesis Testing Problem

What if we consider associations of groups of s SNPs with the phenotype?

This leads to an enormous multiple testing problem: Any of the k SNP sets wouldcorrespond to a hypothesis that is tested (k ∈ O(f s)), where f is the number of SNPs.

If unaccounted for, α per cent of all SNP sets might be considered significantlyassociated by random chance.

It is imperative to control for multiple testing, e.g. the family-wise error rate!

If accounted for, e.g. by Bonferroni correction (α

k), we might lose all statistical power.

Long considered unsolvable dilemma

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 9 / 46

Page 19: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping

Multiple Hypothesis Testing Problem

What if we consider associations of groups of s SNPs with the phenotype?

This leads to an enormous multiple testing problem: Any of the k SNP sets wouldcorrespond to a hypothesis that is tested (k ∈ O(f s)), where f is the number of SNPs.

If unaccounted for, α per cent of all SNP sets might be considered significantlyassociated by random chance.

It is imperative to control for multiple testing, e.g. the family-wise error rate!

If accounted for, e.g. by Bonferroni correction (α

k), we might lose all statistical power.

Long considered unsolvable dilemma

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 9 / 46

Page 20: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping as a Data Mining Problem

Cases

Controls

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

pattern

0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

Feature Selection: Find features that distinguish classes of objects

Pattern Mining: Find higher-order combinations of binary features, so-called patterns,to distinguish one class from another

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 10 / 46

Page 21: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping as a Data Mining Problem

Pattern

D is a dataset of n patients. The i-th patient is representedby a binary vector d(i) ∈ {0, 1}f and a class label yi ∈ {0, 1}.We choose a subset S of all features F in a dataset: S ⊆ F .

Then an object d(i) includes the pattern S if∏t∈S

d (i)(t) = 1,

otherwise not.

Cases

Controls

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

pattern

0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

Problem Statement: Significant Pattern Mining

We want to find all subsets S such that there is a statistically significant association

between∏t∈S

d (i)(t) and yi for i ∈ {1, . . . , n}, while controlling the family-wise error rate

at level α.Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 11 / 46

Page 22: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s trick

Contingency table for testing enrichment of a pattern in one of two classes

Pattern present Pattern absent

y=0 a n1 − a n1

y=1 x − a n − n1 − x + a n − n1

x n − x n

A popular choice is Fisher’s exact test to test whether the pattern is overrepresented inone of the two classes.

The common way to compute p-values for Fisher’s exact test is based on thehypergeometric distribution and assumes fixed total marginals (x , n1, n).

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 12 / 46

Page 23: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s trick

Contingency table for testing enrichment of a pattern in a class

Pattern present Pattern absent

y=0 a n1 − a n1

y=1 x − a n − n1 − x + a n − n1

x n − x n

Tarone (1990) noted that when working with discrete test statistics, e.g. Fisher’s exacttest, there is a minimum p-value that a pattern can achieve.

There are many untestable hypotheses whose minimum p-value is not smaller thanα

k.

Only the remaining m(k) testable hypotheses can reach significance at all.

One can correct for m(k) instead of k . As often m(k) << k , this greatly improvesstatistical power.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 13 / 46

Page 24: Karsten Borgwardt ETH Zurich, Department Biosystems

Example: PTC dataset (Helma et al., 2001)

Maximum size of subgraph nodes

Corr

ectio

n fa

ctor

5 10 15 Limitless103

105

107

BonferroniOurs

Maximum size of subgraph nodes5 10 15 Limitless

Num

ber o

f sig

ni�c

ant

subg

raph

s

0

1

2

3

4 BonferroniOurs

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 14 / 46

Page 25: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s approach (1990)

Assume k is the number of tests that we correct for.

m(k) is the number of testable hypotheses at significance levelα

k.

m(k) is a function of k and we require k ≥ m(k) to correct for all testable hypotheses.

Then the optimization problem is

min k

s. t. k ≥ m(k)

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 15 / 46

Page 26: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s approach (1990)

Assume k is the number of tests that we correct for.

m(k) is the number of testable hypotheses at levelα

k.

procedure Tarone

k := 1;while k < m(k) do

k := k + 1;

return k

How to efficiently compute m(k) without running through all O(f s) possible hypotheses?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 16 / 46

Page 27: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s approach (1990)

Assume k is the number of tests that we correct for.

m(k) is the number of testable hypotheses at levelα

k.

procedure Tarone

k := 1;while k < m(k) do

k := k + 1;

return k

How to efficiently compute m(k) without running through all O(f s) possible hypotheses?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 16 / 46

Page 28: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Data mining challenge

How to efficiently find m(k) without running through all O(f s) possible hypotheses?

Solution: Minimum p-value is determined by the frequency of a pattern.

One can use frequent pattern mining algorithms from Data Mining to enumerate allpatterns that pass a certain p-value threshold (Terada et al., PNAS 2013):

frequent itemset mining(D, θ) enumerates all patterns in a dataset D of frequency atleast θ.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 17 / 46

Page 29: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Frequency versus minimum p-value

0 xmin( ) n1 n/2 n n1 n xmin( ) n

x

12

10

8

6

4

2

0

log 1

0(p m

in)

Minimum attainable P-valueUntestable at level Testable at level log10( )

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 18 / 46

Page 30: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s approach with frequent itemset mining

Assume k is the number of tests that we correct for.

m(k) is the number of testable hypotheses at significance levelα

k.

procedure Tarone(D, α)k := 1;while k < m(k) do

k := k + 1;m(k) := frequent itemset mining(D, φ(

α

k));

return k

Note: φ(α

k) is the minimum frequency of a pattern that is testable at level

α

k.

For small k, φ(α

k) is small. Frequent itemset mining will be extremely expensive!

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 19 / 46

Page 31: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s approach with frequent itemset mining

Assume k is the number of tests that we correct for.

m(k) is the number of testable hypotheses at significance levelα

k.

procedure Tarone(D, α)k := 1;while k < m(k) do

k := k + 1;m(k) := frequent itemset mining(D, φ(

α

k));

return k

Note: φ(α

k) is the minimum frequency of a pattern that is testable at level

α

k.

For small k, φ(α

k) is small. Frequent itemset mining will be extremely expensive!

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 19 / 46

Page 32: Karsten Borgwardt ETH Zurich, Department Biosystems

Significant Pattern Mining

Tarone’s approach with frequent itemset mining

Assume k is the number of tests that we correct for.

m(k) is the number of testable hypotheses at significance levelα

k.

procedure Tarone(D, α)k := 1;while k < m(k) do

k := k + 1;m(k) := frequent itemset mining(D, φ(

α

k));

return k

Note: φ(α

k) is the minimum frequency of a pattern that is testable at level

α

k.

For small k, φ(α

k) is small. Frequent itemset mining will be extremely expensive!

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 19 / 46

Page 33: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

1 How to efficiently find the optimal k? (SDM 2015)

We proposed an efficient search strategy with early termination criterion (when m(k) > k).

2 Patterns are in subset/superset relationships. How to account for this dependencebetween tests? (KDD 2015)

We perform Westfall-Young Permutations to take the dependence into account.By dynamically updating the frequency threshold, we only require 1 single application offrequent itemset mining even for 10,000 permutations.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 20 / 46

Page 34: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

1 How to efficiently find the optimal k? (SDM 2015)

We proposed an efficient search strategy with early termination criterion (when m(k) > k).

2 Patterns are in subset/superset relationships. How to account for this dependencebetween tests? (KDD 2015)

We perform Westfall-Young Permutations to take the dependence into account.By dynamically updating the frequency threshold, we only require 1 single application offrequent itemset mining even for 10,000 permutations.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 20 / 46

Page 35: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

1 How to efficiently find the optimal k? (SDM 2015)

We proposed an efficient search strategy with early termination criterion (when m(k) > k).

2 Patterns are in subset/superset relationships. How to account for this dependencebetween tests? (KDD 2015)

We perform Westfall-Young Permutations to take the dependence into account.By dynamically updating the frequency threshold, we only require 1 single application offrequent itemset mining even for 10,000 permutations.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 20 / 46

Page 36: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

1 How to efficiently find the optimal k? (SDM 2015)

We proposed an efficient search strategy with early termination criterion (when m(k) > k).

2 Patterns are in subset/superset relationships. How to account for this dependencebetween tests? (KDD 2015)

We perform Westfall-Young Permutations to take the dependence into account.By dynamically updating the frequency threshold, we only require 1 single application offrequent itemset mining even for 10,000 permutations.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 20 / 46

Page 37: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

3 Can we retain efficiency and statistical power when accounting for categorical covariatessuch as age and gender? (NIPS 2016)

We extended Tarone’s trick to the Cochran-Mantel-Haenszel-test for stratified contingencytables.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 21 / 46

Page 38: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

3 Can we retain efficiency and statistical power when accounting for categorical covariatessuch as age and gender? (NIPS 2016)

We extended Tarone’s trick to the Cochran-Mantel-Haenszel-test for stratified contingencytables.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 21 / 46

Page 39: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

3 Can we retain efficiency and statistical power when accounting for categorical covariatessuch as age and gender? (NIPS 2016)

We extended Tarone’s trick to the Cochran-Mantel-Haenszel-test for stratified contingencytables.

4 Can we develop new combinatorial association mapping approaches based on Tarone’strick? (ISMB 2015, OUP Bioinformatics 2017, ISMB 2018)

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 21 / 46

Page 40: Karsten Borgwardt ETH Zurich, Department Biosystems

From Significant Pattern Mining to Combinatorial Association Mapping

Questions unanswered in 2014

3 Can we retain efficiency and statistical power when accounting for categorical covariatessuch as age and gender? (NIPS 2016)

We extended Tarone’s trick to the Cochran-Mantel-Haenszel-test for stratified contingencytables.

4 Can we develop new combinatorial association mapping approaches based on Tarone’strick? (ISMB 2015, OUP Bioinformatics 2017, ISMB 2018)

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 21 / 46

Page 41: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping for Genetic Heterogeneity Discovery

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 22 / 46

Page 42: Karsten Borgwardt ETH Zurich, Department Biosystems

Genetic Heterogeneity Discovery

Cases

Controls

0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

pattern

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 23 / 46

Page 43: Karsten Borgwardt ETH Zurich, Department Biosystems

Genetic Heterogeneity Discovery

Genetic heterogeneity

Genetic heterogeneity refers to the phenomenon that several different genes or sequencevariants may give rise to the same phenotype.

Cases

Controls

0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

pattern

Specific problem studied here: Find a genomic interval such that having

a rare variant,a recessive genotype, ora minor allele

in this interval is associated with the disease phenotype.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 24 / 46

Page 44: Karsten Borgwardt ETH Zurich, Department Biosystems

Genetic Heterogeneity Discovery

Fast Automatic Interval Search (Llinares-Lopez et al., ISMB 2015)

Current state of the art: Restrict search to intervals that correspond to genes or exons(Lee et al., AJHG 2014).

Our goal is to search for intervals that may exhibit genetic heterogeneity, while

allowing for arbitrary start and end points of the intervals,properly correcting for the inherent multiple testing problem, andretaining statistical power and computational efficiency.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 25 / 46

Page 45: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

Genetic Heterogeneity Discovery as a Pattern Mining Problem

Cases

Controls

0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

pattern

We model the search as a pattern mining problem: Given an interval, an individualcontains a pattern, if it has at least one minor allele in this interval.

An interval is represented by its maximum value. The longer an interval, the more likelyit is that this maximum is 1.

Association is measured by Fisher’s exact test, and we control the family-wise error rateusing Tarone’s method.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 26 / 46

Page 46: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

0 xmin( ) n1 n/2 n n1 n xmin( ) n

x

12

10

8

6

4

2

0

log 1

0(p m

in)

Minimum attainable P-valueUntestable at level Testable at level log10( )

If too many individuals have a particular pattern, the corresponding interval is nottestable.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 27 / 46

Page 47: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

Cases

Controls

0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0

SNPs

1

1

1

0

0

0

pattern

Pruning criterion: If a pattern is too frequent to be testable, then none of thesuperintervals of the corresponding interval is testable.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 28 / 46

Page 48: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

J1, 1K J2, 2K J3, 3K J4, 4K J5, 5K J6, 6K

J1, 2K J2, 3K J3, 4K J4, 5K J5, 6K

J1, 3K J2, 4K J3, 5K J4, 6K

J1, 4K J2, 5K J3, 6K

J1, 5K J2, 6K

J1, 6K

Search strategy: We search intervals of increasing length l and prune untestablesuperintervals.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 29 / 46

Page 49: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

J1, 1K J2, 2K J3, 3K J4, 4K J5, 5K J6, 6K

J1, 2K J2, 3K J3, 4K J4, 5K J5, 6K

J1, 3K J2, 4K J3, 5K J4, 6K

J1, 4K J2, 5K J3, 6K

J1, 5K J2, 6K

J1, 6K

Search strategy: Specifically, for each interval of length l , we prune it if at least oneof its two length l − 1 subintervals is too frequent to be testable.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 29 / 46

Page 50: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

●●

●●

0.1

1.0

10.0

100.0

1000.0

10000.0

100000.0

1000 5000 10000 50000 100000Length of sequence

Tim

e ta

ken

(sec

onds

)Method

BRUTE−WY

BRUTE

FAIS−WY

FAIS

Our method FAIS (Fast Automatic Interval Search) improves over the brute-forceinterval search in terms of runtime in simulations.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 30 / 46

Page 51: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

● ● ●

● ● ●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9p_case

Pow

er

Method

FAIS−WY

FAIS

BRUTE

UFE

Our method FAIS (Fast Automatic Interval Search) improves over brute-force intervalsearch and univariate approaches in terms of power in simulations.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 31 / 46

Page 52: Karsten Borgwardt ETH Zurich, Department Biosystems

FAIS: Finding Intervals That Exhibit Genetic Heterogeneity

70% (152)

18.9% (41)

6.9% (15)

4.2 % (9)

Novel Intervals

UFE ± 10kb \ LMM ± 10kb

LMM ± 10kb \ UFE ± 10kb

LMM ± 10kb � UFE ± 10kb

FAIS detects 217 significant intervals on 21 binary phenotypes from Arabidopsisthaliana, with 214,051 SNPs and up to 194 lines (Atwell et al., Nature 2010).

70% would have been missed by univariate approaches (UFE and LMM).Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 32 / 46

Page 53: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping: Summary and Outlook

Summary

Significant Pattern Mining was long considered an unsolvable problem.

We have improved Significant Pattern Mining on several levels that allow us to use it forinstances of Combinatorial Association Mapping at a genome-wide scale, e.g. forGenetic Heterogeneity Discovery.

www.significant-patterns.org

Next challenges

How to detect genetic heterogeneity in biological pathways?

How to control the False Discovery Rate?

How to deal with non-binary features and non-binary phenotypes?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 33 / 46

Page 54: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping: Summary and Outlook

Summary

Significant Pattern Mining was long considered an unsolvable problem.

We have improved Significant Pattern Mining on several levels that allow us to use it forinstances of Combinatorial Association Mapping at a genome-wide scale, e.g. forGenetic Heterogeneity Discovery.

www.significant-patterns.org

Next challenges

How to detect genetic heterogeneity in biological pathways?

How to control the False Discovery Rate?

How to deal with non-binary features and non-binary phenotypes?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 33 / 46

Page 55: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping: Summary and Outlook

Summary

Significant Pattern Mining was long considered an unsolvable problem.

We have improved Significant Pattern Mining on several levels that allow us to use it forinstances of Combinatorial Association Mapping at a genome-wide scale, e.g. forGenetic Heterogeneity Discovery.

www.significant-patterns.org

Next challenges

How to detect genetic heterogeneity in biological pathways?

How to control the False Discovery Rate?

How to deal with non-binary features and non-binary phenotypes?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 33 / 46

Page 56: Karsten Borgwardt ETH Zurich, Department Biosystems

Combinatorial Association Mapping: Summary and Outlook

Summary

Significant Pattern Mining was long considered an unsolvable problem.

We have improved Significant Pattern Mining on several levels that allow us to use it forinstances of Combinatorial Association Mapping at a genome-wide scale, e.g. forGenetic Heterogeneity Discovery.

www.significant-patterns.org

Next challenges

How to detect genetic heterogeneity in biological pathways?

How to control the False Discovery Rate?

How to deal with non-binary features and non-binary phenotypes?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 33 / 46

Page 57: Karsten Borgwardt ETH Zurich, Department Biosystems

What’s next?

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 34 / 46

Page 58: Karsten Borgwardt ETH Zurich, Department Biosystems

Biomedical Software Development

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 35 / 46

Page 59: Karsten Borgwardt ETH Zurich, Department Biosystems

easyGWAS

We have been developing easygwas.org (Grimm et al., 2017), a cloud platform forgenome-wide association studies (1362 users as of April 11, 2018):

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 36 / 46

Page 60: Karsten Borgwardt ETH Zurich, Department Biosystems

Biomarker Discovery for Sepsis

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 37 / 46

Page 61: Karsten Borgwardt ETH Zurich, Department Biosystems

Personalized Swiss Sepsis Study

Consortium of 22 research labs and 5 university hospitals in SwitzerlandGoal: Predict sepsis and sepsis-related mortalityApproach: Integrate clinical data and molecular data for joint biomarker discovery

Duration:3 years(2018-2021)

Total funding:5.3 Million CHF

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 38 / 46

Page 62: Karsten Borgwardt ETH Zurich, Department Biosystems

Predicting Sepsis

Background: What is sepsis and why is it relevant?

Sepsis is a life-threatening organ dysfunction, caused by a dysregulated host response toinfection (Singer et al., 2016).

Identification of a bacterial species in blood still takes between 24h and 48h after bloodsampling (Osthoff et al., 2017).

From onset each hour of delayed effective antibiotic treatment increases mortality (Ferrer et

al., 2014).

→ The first hours of sepsis are of critical importance.→ Currently, when sepsis is detected, organ damage has already progressed.→ Detecting and treating sepsis earlier and better identifying high-risk subgroups

could be of highest clinical impact.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 39 / 46

Page 63: Karsten Borgwardt ETH Zurich, Department Biosystems

Predicting Sepsis

Background: What is sepsis and why is it relevant?

Sepsis is a life-threatening organ dysfunction, caused by a dysregulated host response toinfection (Singer et al., 2016).

Identification of a bacterial species in blood still takes between 24h and 48h after bloodsampling (Osthoff et al., 2017).

From onset each hour of delayed effective antibiotic treatment increases mortality (Ferrer et

al., 2014).

→ The first hours of sepsis are of critical importance.→ Currently, when sepsis is detected, organ damage has already progressed.→ Detecting and treating sepsis earlier and better identifying high-risk subgroups

could be of highest clinical impact.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 39 / 46

Page 64: Karsten Borgwardt ETH Zurich, Department Biosystems

Predicting Sepsis

Dataset: MIMIC (https://mimic.physionet.org)

Labels:

Case if sepsis-3 criteria (Singer et al., 2016) fulfilled during ICU stay (at least 4 hours afteradmission) using the notion ’suspicion of infection’ as defined in (Seymour et al., 2016)

Control if no suspicion of infection (at least not during ICU stay and the 2 preceding weeks)SOFA increase evaluated as the maximum SOFA of the 3d window around suspicion ofinfection (–2 to +1 days) compared to baseline SOFA (3d window before that).

Features: In-ICU time series of heart rate, systolic blood pressure, and respiratory rate.

Sample size: 355 case ICU stays, 21,079 controls (sampling 355)

Exclusion criteria: Age < 15, CareVue logging (insufficient detail), chartvalues orin/out-time unavailable.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 40 / 46

Page 65: Karsten Borgwardt ETH Zurich, Department Biosystems

Predicting Sepsis

0

30

0

30

0

30

0

30

0

30

0

30

Res

pir

ator

yR

ate

0

30

0

30

0

30

0 20 40 60 80Hours since ICU admission

0

30

Shapelet Cases Controls

We detect patterns in respiratory rate time series that are statistically significantlyassociated with sepsis (Bock et al., ISMB 2018 - in press).

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 41 / 46

Page 66: Karsten Borgwardt ETH Zurich, Department Biosystems

Data Mining in the Life Sciences

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 42 / 46

Page 67: Karsten Borgwardt ETH Zurich, Department Biosystems

Data Mining in Genetics, Medicine and the Life Sciences

Outlook

Automation, biomarker discovery, and biomedical data management will remain keyresearch topics.

Data growth in three dimensions will pose extreme new challenges in Data Mining inGenetics and Medicine:

Population-scale datasets of individualsLife-long recordings of health stateHighest-resolution information of the health state

How to mine (handle and use) this data?

Many branches of the Life Sciences face very similar or analogous problems.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 43 / 46

Page 68: Karsten Borgwardt ETH Zurich, Department Biosystems

Data Mining in Genetics, Medicine and the Life Sciences

Outlook

Automation, biomarker discovery, and biomedical data management will remain keyresearch topics.

Data growth in three dimensions will pose extreme new challenges in Data Mining inGenetics and Medicine:

Population-scale datasets of individualsLife-long recordings of health stateHighest-resolution information of the health state

How to mine (handle and use) this data?

Many branches of the Life Sciences face very similar or analogous problems.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 43 / 46

Page 69: Karsten Borgwardt ETH Zurich, Department Biosystems

Data Mining in Genetics, Medicine and the Life Sciences

Outlook

Automation, biomarker discovery, and biomedical data management will remain keyresearch topics.

Data growth in three dimensions will pose extreme new challenges in Data Mining inGenetics and Medicine:

Population-scale datasets of individualsLife-long recordings of health stateHighest-resolution information of the health state

How to mine (handle and use) this data?

Many branches of the Life Sciences face very similar or analogous problems.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 43 / 46

Page 70: Karsten Borgwardt ETH Zurich, Department Biosystems

Data Mining in Genetics, Medicine and the Life Sciences

Outlook

Automation, biomarker discovery, and biomedical data management will remain keyresearch topics.

Data growth in three dimensions will pose extreme new challenges in Data Mining inGenetics and Medicine:

Population-scale datasets of individualsLife-long recordings of health stateHighest-resolution information of the health state

How to mine (handle and use) this data?

Many branches of the Life Sciences face very similar or analogous problems.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 43 / 46

Page 71: Karsten Borgwardt ETH Zurich, Department Biosystems

Data Mining in Genetics, Medicine and the Life Sciences

Outlook

Automation, biomarker discovery, and biomedical data management will remain keyresearch topics.

Data growth in three dimensions will pose extreme new challenges in Data Mining inGenetics and Medicine:

Population-scale datasets of individualsLife-long recordings of health stateHighest-resolution information of the health state

How to mine (handle and use) this data?

Many branches of the Life Sciences face very similar or analogous problems.

Plenty of opportunities for Data Mining in the Life Sciences

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 43 / 46

Page 72: Karsten Borgwardt ETH Zurich, Department Biosystems

Thank you

Marie-Curie-Initial Training Network for ‘Machine Learning for Personalized Medicine’(mlpm.eu, 2013-2016)

Starting Grant (ERC-Backup Scheme of the SNSF)

Alfried-Krupp-Award for Young Professors

SPHN-PHRT Driver Project ‘Personalized Swiss Sepsis Study’

http://www.bsse.ethz.ch/mlcb

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 44 / 46

Page 73: Karsten Borgwardt ETH Zurich, Department Biosystems

References I

C. E. Bonferroni, Teoria statistica delle classi e calcolo delle probabilita (1936). Published: (Pubbl. d. R. Ist.Super. di Sci. Econom. e Commerciali di Firenze. 8) Firenze: Libr. Internaz. Seeber. 62 S. (1936).

R. Ferrer, et al., Critical Care Medicine 42, 1749 (2014).

D. G. Grimm, et al., The Plant Cell 29, 5 (2017).

T. F. Mackay, J. H. Moore, Genome Medicine 6, 42 (2014).

T. A. Manolio, et al., Nature 461, 747 (2009).

S. Lee, et al., The American Journal of Human Genetics 95, 5 (2014).

F. Llinares-Lopez, et al., KDD (2015), pp. 725–734.

F. Llinares-Lopez, et al., Bioinformatics 31, i240 (2015).

F. Llinares-Lopez, et al., Bioinformatics (Oxford, England) (2017).

M. Osthoff, et al., Clinical Microbiology and Infection: The Official Publication of the European Society ofClinical Microbiology and Infectious Diseases 23, 78 (2017).

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 45 / 46

Page 74: Karsten Borgwardt ETH Zurich, Department Biosystems

References II

L. Papaxanthos, et al., NIPS , D. D. Lee, et al., eds. (2016), pp. 2271–2279.

C. W. Seymour, et al., JAMA 315, 762 (2016).

M. Singer, et al., JAMA 315, 801 (2016).

M. Sugiyama, et al., SIAM Data Mining , S. Venkatasubramanian, J. Ye, eds. (SIAM, 2015), pp. 37–45.

R. E. Tarone, Biometrics 46, 515 (1990).

A. Terada, et al., Proceedings of the National Academy of Sciences 110, 12996 (2013).

P. H. Westfall, et al., Biometrics 49, 941 (1993).

Icon source: Icons made by Freepik from www.flaticon.com, licensed under CC BY 3.0.

Department Biosystems Karsten Borgwardt Workshop with Huawei Zurich May 25, 2018 46 / 46