Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-share

  • View
    43

  • Download
    1

  • Category

    Science

Preview:

Citation preview

Big Genome DataSheds Light on Cancer Causes

1

Steven G Rozen steverozen [at] gmail.comhttps://www.duke-nus.edu.sg/content/rozen-steve

https://admissions.duke-nus.edu.sg/adm/Director, Duke-NUS Centre for Computational Biology

https://www.duke-nus.edu.sg/research/centres/centre-computational-biologyProfessor, Program in Cancer and Stem Cell Biology

IEEE International Conference on Computational Intelligence in Bioinformatics and Computational Biology

(IEEE CIBCB 2016) http://www.cibcb2016.org/October 5-7, 2016,

Chiang Mai, Thailand

and Can Improve Prevention and Treatment

2

Singapore• Small: 50 km (31 mi) from

east to west and 26 km (16 mi) north-sourth

• 5.7 million people

• Rich: per capita GDP > US

• Good public health

• Life expectancy at birth 84.7 (US is 79.7)

• Infant mortality: 2.5/1,000 (US is 6/1,000)

http://kids.britannica.com/comptons/art-55105/Singapore

Duke-NUS Graduate Medical School Singapore

3

• Nonprofit joint venture:– Duke in North Carolina, USA– National University of Singapore

(NUS)• On campus of 1,700 bed public

hospital and the National Cancer Center

• 60 MDs / year• 15 PhD students / year• PhD program in biostatistics and

bioinformatics: https://admissions.duke-nus.edu.sg/adm/

4

Duke-NUS Centre for Computational Biology

Many areas of computational biology and bioinformatics

not just cancer genomicshttps://www.duke-nus.edu.sg/research/centres/centre-

computational-biology

Outline• Revolution in sequencing technology

– “Next generation sequencing” underlies tectonic shifts in approaches to biomedical research, especially in cancer

– Makes much heavier demands on computation, data storage, and bioinformatics

5

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer– Cancer genomics: a study of “global”

characteristics of tumors– Finding mutations that cause cancer is an

important use of next generation sequencing

6

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer• Finding signatures of processes that

cause mutations– Mutations cause cancer (partly)– Now we can find out what causes mutations

7

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer• Signatures of processes that cause

mutations • Applications and challenges for cancer

prevention and treatment– How genomics can improve cancer prevention and

treatment– How you can help

8

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer• Finding signatures of processes that

cause mutations • Applications and challenges for cancer

prevention and treatment

9

Revolution in sequencing technology

10

http:

//w

ww

.gen

ome.

gov/

sequ

enci

ngco

sts/

Cost

per

hum

an g

enom

e (U

S $)

Year

$10 millionIn 2007

10,000 Xdrop in price

$1.5 thousandtoday

Immense innovation in research techniques

Next generation sequencing(NGS) is a foundational technology

11

Shendure & Aiden,Nat. Biotech. 2012

Next generation sequencing(NGS) will become a foundational technology

12

Snippets of DNA or RNA are self-identifying (self-barcoded)

Then sequence the DNA or RNA segments

Use biochemical tricks to select for DNA or RNA segments physically associated with “biological process” of interest”

Applications of NGS

13

Method To determine…DNA-Seq A genome sequence

Targeted DNA-Seq A subset of a genome (for example, an exome)

RNA-Seq RNA (that is, the transcriptome)Methyl-Seq Sites of DNA methylation, genome-wideTargeted methyl-Seq DNA methylation in a subset of the genome

DNase-Seq, Sono-Seq Active regulatory chromatin (nucleosome-depleted)

FAIRE-Seq (formaldehyde-assisted isolation of regulatory elements)

Active regulatory chromatin (nucleosome-depleted)

MAINE-Seq (MNase-assisted isolation of nucleosomes)

Histone-bound DNA (nucleosome positioning)

ChIP-Seq Protein-DNA interactions (using chromatin immunoprecipitation)

RIP-Seq (RNA-binding protein immunoprecipitation) Protein-RNA interactions

CLIP-Seq (cross-linking IP) Protein-RNA interactions

Adapted from Shendure and Aiden, Nat. Biotech. 2012

Applications of NGS

14

Method To determine…DNA-Seq A genome sequence

Targeted DNA-Seq A subset of a genome (for example, an exome)

RNA-Seq RNA (that is, the transcriptome)Methyl-Seq Sites of DNA methylation, genome-wideTargeted methyl-Seq DNA methylation in a subset of the genome

DNase-Seq, Sono-Seq Active regulatory chromatin (nucleosome-depleted)

FAIRE-Seq (formaldehyde-assisted isolation of regulatory elements)

Active regulatory chromatin (nucleosome-depleted)

MAINE-Seq (MNase-assisted isolation of nucleosomes)

Histone-bound DNA (nucleosome positioning)

ChIP-Seq Protein-DNA interactions (using chromatin immunoprecipitation)

RIP-Seq (RNA-binding protein immunoprecipitation) Protein-RNA interactions

CLIP-Seq (cross-linking IP) Protein-RNA interactions

Adapted from Shendure and Aiden, Nat. Biotech. 2012

Many applications• Human variation / human genetics• Cancer genetics• Clinical applications in human

genetics and cancer genetics• Plant and animal breeding for

agriculture• Sequencing new genomes• Metagenomics• Pathogen discovery• Pathogen evolution

15

Bioinformatics technologies for next-generation sequencing

• More data means need for more computing power

• Mixture of 1. On premises (Duke-NUS) high-performance computing (HPC)2. National Supercomputing Centre 3. Cloud computing

16

High performance computingat Duke-NUS

• 0.5 petabytes• 720 cores

17

1. On-premises computing HPC

• Advantages:Full range of analytical pipelines / softwareEasy software configuration, development and debuggingLess need for data transfers

• However processing and disk storage are limited

18

2. National Supercomputing Centre• 10 petabytes• 30,000 cores• 128 nodes with

NVIDIA K40 GPUs for molecular dynamics, deep learning, etc.

• 100 Gps data network to US

19

3. High performance cloud computing

20

DNAnexus cloud computing

21

DNAnexus cloud computing

22

DNAnexus cloud computing

21h 59m $176.29

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer• Finding signatures of processes that

cause mutations• Applications and challenges for cancer

prevention and treatment

23

The broader vision for personalized cancer therapy• If we can understand each tumor in detail, we can

select the therapies that are most likely to be effective

• Will likely involve combinations of treatment because of tumors’ abilities to develop resistance

• I.e. a few cells in the tumor already have mutations that make them resistant one treatment, but no cells have mutations that make them resistant to two treatments

24

Example: BRAF V600E mutationscause cancer, leading to treatment

25http://www.nzmu.co.nz/braf-inhibitors New Zealand

A 38-year-old man with BRAF-mutant melanoma and subcutaneous metastatic deposits Before treatment After 15 weeks After relapse (resistance)

at 23 weeks

Nikhil Wagle et al. JCO 2011;29:3085-3096

©2011 by American Society of Clinical Oncology

Our study of genomics ofbile duct cancer(cholangiocarcinoma, CCA)

27http://www.hopkinsmedicine.org/liver_tumor_center/conditions/bile_duct_cancer.html

28

Some CCA is caused by liver flukes (worms)

Sripa et al. 2007

“OV” (Opisthorcis viverrini)liver fluke

Koi pla(raw fresh-water fish tartar / salad)

29

OV (liver fluke) infection and CCA

Sripa et al. 2007

Chiang Mai

BAP1 confirmed as a tumor suppressor innon-liver-fluke bile-duct cancer

Chan-On et al., Nat. Genet. 2013

30

MLL3, a driver in liver-flukebile-duct cancer

31

Ong et al., Nat. Genet. 2012

Liver-fluke versus non-fluke bile-duct cancers

Chan-On et al., Nat. Genet. 2013 32

Non-fluke Fluke

Implications for treatment?

• We are now working on new classifications based on more extensive studies partly based on differences between fluke and non-fluke CCAs

• Findings suggest potential avenues for treatments

33

Fibroadenoma: benign but very common breast tumor

34

Lim et al., Nat. Genet. 2014

Hot spot mutations in MED12 in 60% of tumors

Fibroadenoma: benign but very common breast tumor

35

Lim et al., Nat. Genet. 2014

Hot spot mutations in MED12 in 60% of tumors

36

Breast fibroepithelial tumors(fibroadenoma [FA] and phyllodes)

Tan et al, Nature Genet. 2015

Phyllodes

37

Tan et al, Nature Genet. 2015

NR-LBD: ligand binding domainNR-DBD: DNAbinding domain

RARA (retinoic acid receptor alpha)

38

Breast fibroepithelial tumors(fibroadenoma and phyllodes)

Tan et al, Nature Genet. 2015

Phyllodes

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer• Finding signatures of mutagenic

processes• Applications and challenges for cancer

prevention and treatment

39

*Per 100,000, age adjusted to the 2000 US standard population.Note: Due to changes in ICD coding, numerator information has changed over time. Rates for cancer of the liver, lung and bronchus, and colon and rectum are affectedby these coding changes.Source: US Mortality Volumes 1930 to 1959, US Mortality Data 1960 to 2008, National Center for Health Statistics, Centers for Disease Control and Prevention.

©2012, American Cancer Society, Inc., Surveillance Research

US

mal

e de

aths

/ 10

0,00

0, a

ge-a

djus

ted

Carcinogenic exposures are important

Lung and bronchus

Stomach

*Per 100,000, age adjusted to the 2000 US standard population.Note: Due to changes in ICD coding, numerator information has changed over time. Rates for cancer of the liver, lung and bronchus, and colon and rectum are affectedby these coding changes.Source: US Mortality Volumes 1930 to 1959, US Mortality Data 1960 to 2008, National Center for Health Statistics, Centers for Disease Control and Prevention.

©2012, American Cancer Society, Inc., Surveillance Research

US

mal

e de

aths

/ 10

0,00

0, a

ge-a

djus

ted

Carcinogenic exposures are important

Some carcinogenic exposure are mutagens, so recognizing the sources of mutagenic exposures by examining the mutations they cause can identify causes of cancer

Lung and bronchus

Stomach?

Mutation signature example:An herbal remedy

Aristolochia Plants

Herbal remediesAristolochic acid I

AA

42

AdenineDNA adducts,

Adenine > Thymine mutations

Adapted from Poon et al, Science Translational Medicine, 2013

AA causes kidney failure

43

AA causes extremely high numbersof somatic (i.e. tumor specific) mutations in upper tract urothelial carcinomas

44

Who

le g

enom

eso

mati

c mut

ation

coun

ts

Poon et al, Science Translational Medicine, 2013

Location ofUTUCsAA

UTU

C

UV

mel

anom

a

Toba

cco

lung

Theoretically banned/restrictedbut multiple AA plant species readily available on web( 马兜铃 mǎ dōu líng)

45

Acute kidney toxin and potent mutagen300g, $27.52 Free shipping

青木香 , qīng mù xiāng, contains AA

46

天仙藤 , tiānxiān téng, contains AA

47

广防己 , guǎng fáng jǐ, contains AA

48

What does AA signature look like?

49Poon et al, Science Translational Medicine, 2013

A>T A>G A>CG>AG>CG>T

CAG>CTG

CAA>CTATAG>TTG

Transcribed versus non-transcribed

strand bias

Remember: we are focusing on what causes the mutations, not how the mutations cause cancer

Some tumors contain both AA mutationsand mutations due to other causes

50

Poon et al, Genome Medicine, 2015

Overlays of mutations due to differentexogenous mutagens or endogenous mutagenic processes-- Computational separation

51

Poon et al, Genome Medicine, 2015

AA

Overlays of mutations due to differentexogenous mutagens or endogenous mutagenic processes-- Computational separation

52

Poon et al, Genome Medicine, 2015

AA

ABOBEC

AA

ABOBEC

53

Poon et al, Genome Medicine, 2015

CpG > TpG (deamination of 5 methyl cytosine)

NMFNon-NegativeFactorization

Overlays of mutations due to differentexogenous mutagens or endogenous mutagenic processes-- Computational separation

Non negative matrix factorization (NMF)

Mutation signaturesMutations contributed by each signature

Mutations not present in the reconstructed catalog

Observed somatic mutation catalog of a tumor genome

W X

≈N signatures

K m

utati

on ty

pes

K m

utati

on ty

pes

N si

gnat

ures

T tumors

T tumors

Level of exposure of 1 tumor to 1 signature

H

V (observed mutations)

Adapted from Alexandrov 2013

One can estimate the contributioneach mutagen to the mutations in each tumor

55

Poon et al, Genome Medicine, 2015

Important points on NMF• NMF is only a tool – best (lowest error) approximate

factorization does not necessarily correspond to any biological reality

• Models derived by NMF should be useful – provide information on exposures or mutational processes; "All models are wrong but some are useful"1

• Signature extraction and activity (exposure) assignment are separate; can have good signature extraction but poor activity assignment, because factorization is usually underdetermined

• Must combine NMF with additional information to find useful models

56

1 George E. P. Box (1979), "Robustness in the strategy of scientific model building", in Launer, R. L.; Wilkinson, G. N., Robustness in Statistics, Academic Press, pp. 201–236.

Mutation signatures, state of the art

57

CpG > TpG

Tobacco smoke

Activated APOBEC

Activated APOBEC

BRCA1/2 deficiency

. . . (14 more)

COSMIC http://cancer.sanger.ac.uk/cosmic/signatures, Ludmil Alexandrov, Serena Nik-Zainal, Mike Stratton

Mutation signatures, state of the art

58

COSMIC http://cancer.sanger.ac.uk/cosmic/signatures, Ludmil Alexandrov, Serena Nik-Zainal, Mike Stratton

Mutation signatures, state of the art

• ~30 signatures in ~10,000 tumors (COSMIC web page http://cancer.sanger.ac.uk/cosmic/signatures)

• 6 due to known exogenous exposures• 14 have no known cause• Remainder due to endogenous factors (e.g.

polymerase epsilon mutation, DNA mismatch repair deficiency, other repair deficiency (e.g. BRCA1/2)

• Extended mutation signatures of many known carcinogens are unknown

59

Molecular epidemiology of AA

Aristolochia Plants

Herbal remediesAristolochic acid I

AA

60

Adenine DNA adducts,

Adenine > Thymine mutations

Adapted from Poon et al, Science Translational Medicine, 2013

A few years ago:AA exposure in upper tract urothelial cancer

Upper Tract Urothelial

Today: AA exposure in multiple tumor types

(whole genome)

Bladder

Upper Tract Urothelial

Kidney

Bile duct

Liver

Poon et al., 2013 and subsequent data (HCC)Zou et al., 2015 (CCA)Scelo et al., 2014 and Jelakovic et al., 2014 (RCC)Poon et al., 2013, Hoang et al., 2013 (UTUC)Poon et al., 2015 (Bladder)

63

Taiwan (herbal medicine)

Grollman, 2013

Some areas of the Balkans (contamination of wheat flour with AA-containing seeds), in

association with Balkan endemic nephropathy

A few years ago:AA thought to be confined to a few geographical hot spots

64

China, Taiwan, Singaporemultiple cancer types, population > 1 billion

Today: Molecular evidence (AA signature) for much more widespread exposure

AA in renal cell carcinomas from regions of Romania and Croatia

not associated with Balkan endemic nephropathy

Scelo et al., 2014 and Jelakovic et al., 2014

Non-molecular evidence thatAA exposure even more widespread

• India / South Asia, Aristolochia indica plants used in traditional medicine (population > 1 billion)

• South / Central America, Aristolochia plants used in traditional medicine; extent unclear

65

66

Evidence of widespread use of AA-containing plants in South Asia

67

Evidence of widespread use of AA-containing plants in South Asia

Cultivated AA plant or AA plant

product purchased in market

AA inCentral American “snake bottle”

68

Photograph by Donald HallUniversity of Florida

Aristolochia trilobata Battus polydamas

Photograph: Kimera Corporation

AA

Clinical implications ofWidespread AA exposure• Primary prevention (avoiding AA)

– Regulation, education– Possible: unlike tobacco, presumably non-addictive– Possible: unlike aflatoxin, ingested deliberately

• Secondary prevention: focused screening for people with known or likely exposure based on– Geography– Use of AA-containing remedies– Kidney failure– Previous AA-related cancer (e.g. based on signature)

69

Next steps: causes of many mutation signatures are unknown

Unknown

Next steps: causes of many mutation signatures are unknown

Unknown

… and we do NOT know the signatures of 100s of known or suspected mutagenic carcinogens

Next steps: causes of many mutation signatures are unknown

Unknown

… and we do NOT know the signatures of 100s of known or suspected mutagenic carcinogens To resolve this: experimental studies

Next steps: causes of many mutation signatures are unknown

Unknown

… and we do NOT know the signatures of 100s of known or suspected mutagenic carcinogens To resolve this: experimental studies

Need experience and robust experimental systems

Outline• Revolution in sequencing technology• Cancer genomics and finding mutations

that cause cancer• Finding signatures of processes that

cause mutations • Applications and challenges for cancer

prevention and treatment

74

75

Primary prevention

76

Treatment

• Using genomic information (including mutations and gene expression) to select the best treatment

• Combination therapy seems to be the only plausible way around resistance

• Need a larger ‘n’ – how did patients with particular genomic characteristics respond to (combination therapy)

77

How to get a larger ‘n’

• Will require sharing genomic and treatment outcome data across organizations

• Several efforts in this directions• One example…

78

• International, multiphase, multiyear project

• Provide the “critical mass” of genomic and clinical data necessary to improve clinical decision making – i.e. large ‘n’

• Catalyze new clinical and translational research.

79

GENIEUse Cases

80

GENIEUse Cases

How you can help• Bioinformatics expertise is critical, and often

limiting, in much of basic and clinical research• The more you can take a broad perspective

and “own” the basic science or clinical questions, the more you will contribute

• More broadly your skills and advocacy needed to “turn the health-care system into a learning system”1

– Overcome technical, administrative, and political obstacles to data sharing for projects like GENIE

811 Goal articulated by Eric Lander

Acknowledgements (partial)Singapore / Duke NUS, National Cancer Centre, Singapore General Hospital and otherSong Ling PoonMi Ni HuangApinya JusakulJohn McPhersonWeng Khong LimPatrick TanBen Tean T e hChoon Kiat OngJing TanPuay Hoon TanBenita K T TanAye Aye Thike, Cedric Chuan Young NgWaraporn Chan-onMaaja-Liisa Nairismagi

82

Taiwan / Chang Gung Memorial Hospital and otherSee-Tong PangWen-Hui WengCheng-Keng ChuangThailand / Khon Kaen University and otherVajarabhongsa BhudhisawasdiChutima SubimerbChawalit PairojkulSopit WongkhamPaiboon SithithaworPuangrat YongvanitFundingSingapore National Medical Research Council; A*STAR and the Singapore Ministry of Health through the Duke-NUS Signature Research Programs; Singapore Millennium Foundation; Lee Foundation, the National Cancer Centre Research Fund; The Verdant Foundation; Cancer Science Institute Singapore; Chang Gung Memorial Hospital; Taiwan National Science Council; Research Team Strengthening Grant, the National Genetic Engineering and Biotechnology Center and the National Science and Technology Development Agency, Thailand.

Recommended