19
University of Groningen Return to the sea, get huge, beat cancer Tollis, Marc; Robbins, Jooke; Webb, Andrew E; Kuderna, Lukas F K; Caulin, Aleah F; Garcia, Jacinda D; Bérubé, Martine; Pourmand, Nader; Marques-Bonet, Tomas; O'Connell, Mary J Published in: Molecular Biology and Evolution DOI: 10.1093/molbev/msz099 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2019 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Tollis, M., Robbins, J., Webb, A. E., Kuderna, L. F. K., Caulin, A. F., Garcia, J. D., ... Maley, C. C. (2019). Return to the sea, get huge, beat cancer: An analysis of cetacean genomes including an assembly for the humpback whale (Megaptera novaeangliae). Molecular Biology and Evolution, 36(8), 1746–1763. https://doi.org/10.1093/molbev/msz099 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 21-09-2020

University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

University of Groningen

Return to the sea get huge beat cancerTollis Marc Robbins Jooke Webb Andrew E Kuderna Lukas F K Caulin Aleah F GarciaJacinda D Beacuterubeacute Martine Pourmand Nader Marques-Bonet Tomas OConnell Mary JPublished inMolecular Biology and Evolution

DOI101093molbevmsz099

IMPORTANT NOTE You are advised to consult the publishers version (publishers PDF) if you wish to cite fromit Please check the document version below

Document VersionPublishers PDF also known as Version of record

Publication date2019

Link to publication in University of GroningenUMCG research database

Citation for published version (APA)Tollis M Robbins J Webb A E Kuderna L F K Caulin A F Garcia J D Maley C C (2019)Return to the sea get huge beat cancer An analysis of cetacean genomes including an assembly for thehumpback whale (Megaptera novaeangliae) Molecular Biology and Evolution 36(8) 1746ndash1763httpsdoiorg101093molbevmsz099

CopyrightOther than for strictly personal use it is not permitted to download or to forwarddistribute the text or part of it without the consent of theauthor(s) andor copyright holder(s) unless the work is under an open content license (like Creative Commons)

Take-down policyIf you believe that this document breaches copyright please contact us providing details and we will remove access to the work immediatelyand investigate your claim

Downloaded from the University of GroningenUMCG research database (Pure) httpwwwrugnlresearchportal For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum

Download date 21-09-2020

Return to the Sea Get Huge Beat Cancer An Analysis ofCetacean Genomes Including an Assembly for the HumpbackWhale (Megaptera novaeangliae)

Marc Tollis123 Jooke Robbins4 Andrew E Webb5 Lukas FK Kuderna6 Aleah F Caulin7

Jacinda D Garcia2 Martine Berube48 Nader Pourmand9 Tomas Marques-Bonet6101112

Mary J OrsquoConnell13 Per J Palsboslashlldagger48 and Carlo C Maleydagger12

1Biodesign Institute Arizona State University Tempe AZ2School of Life Sciences Arizona State University Tempe AZ3School of Informatics Computing and Cyber Systems Northern Arizona University Flagstaff AZ4Center for Coastal Studies Provincetown MA5Center for Computational Genetics and Genomics Temple University Philadelphia PA6Instituto de Biologia Evolutiva (UPF-CSIC) PRBB Barcelona Spain7Genomics and Computational Biology Program University of Pennsylvania Philadelphia PA8Groningen Institute of Evolutionary Life Sciences University of Groningen Groningen The Netherlands9Jack Baskin School of Engineering University of California Santa Cruz Santa Cruz CA10CNAG-CRG Centre for Genomic Regulation (CRG) The Barcelona Institute of Science and Technology Barcelona Spain11Institucio Catalana de Recerca i Estudis Avancats (ICREA) Barcelona Catalonia Spain12Institut Catala de Paleontologia Miquel Crusafont Universitat Autonoma de Barcelona Edifici ICTA-ICP Barcelona Spain13Computational and Molecular Evolutionary Biology Research Group School of Life Sciences University of Nottingham NottinghamUnited KingdomdaggerThese authors contributed equally to this work

Corresponding author E-mail marctollisnauedu

Associate editor Beth Shapiro

Abstract

Cetaceans are a clade of highly specialized aquatic mammals that include the largest animals that have ever lived Thelargest whales can have1000more cells than a human with long lifespans leaving them theoretically susceptible tocancer However large-bodied and long-lived animals do not suffer higher risks of cancer mortality than humansmdashanobservation known as Petorsquos Paradox To investigate the genomic bases of gigantism and other cetacean adaptations wegenerated a de novo genome assembly for the humpback whale (Megaptera novaeangliae) and incorporated the genomesof ten cetacean species in a comparative analysis We found further evidence that rorquals (family Balaenopteridae)radiated during the Miocene or earlier and inferred that perturbations in abundance andor the interocean connectivityof North Atlantic humpback whale populations likely occurred throughout the Pleistocene Our comparative genomicresults suggest that the evolution of cetacean gigantism was accompanied by strong selection on pathways that aredirectly linked to cancer Large segmental duplications in whale genomes contained genes controlling the apoptoticpathway and genes inferred to be under accelerated evolution and positive selection in cetaceans were enriched forbiological processes such as cell cycle checkpoint cell signaling and proliferation We also inferred positive selection ongenes controlling the mammalian appendicular and cranial skeletal elements in the cetacean lineage which are relevantto extensive anatomical changes during cetacean evolution Genomic analyses shed light on the molecular mechanismsunderlying cetacean traits including gigantism and will contribute to the development of future targets for humancancer therapies

Key words cetaceans humpback whale evolution genome cancer

IntroductionCetaceans (whales dolphins and porpoises) are highly spe-cialized mammals adapted to an aquatic lifestyle Divergingfrom land-dwelling artiodactyls during the late Paleocene or

early Eocene 55 Ma (Thewissen et al 2007 OrsquoLeary andGatesy 2008) cetaceans diversified throughout theCenozoic and include two extant groups Mysticeti or thebaleen whales and Odontoceti or the toothed whales

Article

The Author(s) 2019 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited Open Access1746 Mol Biol Evol 36(8)1746ndash1763 doi101093molbevmsz099 Advance Access publication May 9 2019

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Traits evolved for life in the ocean including the loss of hindlimbs changes in skull morphology physiological adaptationsfor deep diving and underwater acoustic abilities includingecholocation make these species among the most divergedmammals from the ancestral eutherian (Berta et al 2015)One striking aspect of cetacean evolution is the large bodysizes achieved by some lineages rivaled only by the giganticterrestrial sauropod dinosaurs (Benson et al 2014) Cetaceanswere not limited by gravity in the buoyant marine environ-ment and evolved multiple giant forms exemplified today bythe largest animal that has ever lived the blue whale(Balaenoptera musculus) Based on evidence from fossils mol-ecules and historical climate data it has been hypothesizedthat oceanic upwelling during the PliocenendashPleistocene sup-ported the suspension feeding typical of modern baleenwhales allowing them to reach their gigantic sizes surprisinglyclose to the present time (Slater et al 2017)

Although the largest whales arose relatively recently largebody size has evolved multiple times throughout the historyof life (Heim et al 2015) including in 10 out of 11 mammalianorders (Baker et al 2015) Animal gigantism is therefore arecurring phenomenon that is seemingly governed by avail-able resources and natural selection (Vermeij 2016) wherepositive fitness consequences lead to repeated directionalselection toward larger bodies within populations(Kingsolver and Pfennig 2004) However there are tradeoffsassociated with large body size including a higher lifetime riskof cancer due to a greater number of somatic cell divisionsover time (Peto et al 1975 Nunney 2018) Surprisingly al-though cancer should be a body mass- and age-related dis-ease large and long-lived animals do not suffer higher cancermortality rates than smaller shorter-lived animals (Abegglenet al 2015) This is a phenomenon known as Petorsquos Paradox(Peto et al 1975) To the extent that there has been selectionfor large body size there likely has also been selection forcancer suppression mechanisms that allow an organism togrow large and successfully reproduce Recent efforts havesought to understand the genomic mechanisms responsiblefor cancer suppression in gigantic species (Abegglen et al2015 Caulin et al 2015 Keane et al 2015 Sulak et al 2016)An enhanced DNA damage response in elephant cells hasbeen attributed to20 duplications of the tumor suppressorgene TP53 in elephant genomes (Abegglen et al 2015 Sulaket al 2016) The bowhead whale (Balaena mysticetus) is alarge whale that may live more than 200 years (Georgeet al 1999) and its genome shows evidence of positive selec-tion in many cancer- and aging-associated genes includingERCC1 which is part of the DNA repair pathway (Keane et al2015) Additionally the bowhead whale genome containsduplications of the DNA repair gene PCNA as well asLAMTOR1 which helps control cellular growth (Keane et al2015) Altogether these results suggest that 1) the genomesof larger and longer-lived mammals may hold the key tomultiple mechanisms for suppressing cancer and 2) as thelargest animals on Earth whales make very promising sourcesof insight for cancer suppression research

Cetacean comparative genomics is a rapidly growing fieldwith 13 complete genome assemblies available on NCBI as of

late 2018 including the following that were available at theonset of this study the common minke whale (Balaenopteraacutorostrata) (Yim et al 2014) bottlenose dolphin (Tursiopstruncatus) orca (Orcinus orca) (Foote et al 2015) and spermwhale (Physeter macrocephalus) (Warren et al 2017) In ad-dition the Bowhead Whale Genome Resource has supportedthe genome assembly for that species since 2015 (Keane et al2015) However to date few studies have used multiple ce-tacean genomes to address questions about genetic changesthat have controlled adaptations during cetacean evolutionincluding the evolution of cancer suppression Here we pro-vide a comparative analysis that is novel in scope leveragingwhole-genome data from ten cetacean species including sixcetacean genome assemblies and a de novo genome assem-bly for the humpback whale (Megaptera novaeangliae)Humpback whales are members of the familyBalaenopteridae (rorquals) and share a recent evolutionaryhistory with other ocean giants such as the blue whale and finwhale (Balaenoptera physalus) (Arnason et al 2018) Theyhave an average adult length of more than 13 m (Claphamand Mead 1999) and a lifespan that may extend to 95 years(Chittleborough 1959 Gabriele et al 2010) making the spe-cies an excellent model for Petorsquos Paradox research

Our goals in this study were 3-fold 1) to provide a de novogenome assembly and annotation for the humpback whalethat will be useful to the cetacean research and mammaliancomparative genomics communities 2) to leverage the geno-mic resource and investigate the molecular evolution of ceta-ceans in terms of their population demographicsphylogenetic relationships and species divergence timesand the genomics underlying cetacean-specific adaptationsand 3) to determine how selective pressure variation on genesinvolved with cell cycle control cell signaling and prolifera-tion and many other pathways relevant to cancer may havecontributed to the evolution of cetacean gigantism The latterhas the potential to generate research avenues for improvinghuman cancer prevention and perhaps even therapies

Results and Discussion

Sequencing Assembly and Annotation of theHumpback Whale GenomeWe sequenced and assembled a reference genome for thehumpback whale using high-coverage paired-end and mate-pair libraries (table 1 NCBI BioProject PRJNA509641) andobtained an initial assembly that was 227 Gb in lengthwith 24319 scaffolds a contig N50 length of 125 kb and ascaffold N50 length of 198 kb Final sequence coverage forthe initial assembly was 76 assuming an estimated ge-nome size of 274 Gb from a 27-mer spectrum analysis Hi Risescaffolding using proximity ligation (Chicago) libraries(Putnam et al 2016 table 1 NCBI BioProject PRJNA509641)resulted in a final sequence coverage of 102 greatly im-proving the contiguity of the assembly by reducing the num-ber of scaffolds to 2558 and increasing the scaffold N50length 46-fold to 914 Mb (table 2) The discrepancy betweenestimated genome size and assembly length has been ob-served in other cetacean genome efforts (Keane et al 2015)

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1747

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

and is likely due to the highly repetitive nature of cetaceangenomes (Arnason and Widegren 1989)

With 95ndash96 of near-universal orthologs from OrthoDBv9 (Sim~ao et al 2015) present in the assembly as well as 97of a set of core eukaryotic genes (Parra et al 2009) the esti-mated gene content of the humpback whale genome assem-bly suggests a high-quality genome with good generepresentation (table 1) To aid in genome annotation wecarried out skin transcriptome sequencing which resulted in281642354 reads (NCBI BioProject PRJNA509641) Thesewere assembled into a transcriptome that includes 67 ofboth vertebrate and laurasiatherian orthologs and we pre-dicted 10167 protein-coding genes with likely ORFs that con-tain BLAST homology to SwissProt proteins (UniProtConsortium 2015) The large number of missing genes from

the transcriptome may be due to the small proportion ofgenes expressed in skin Therefore we also assessed homologywith ten mammalian proteomes from NCBI and the entireSwissProt database and ab initio gene predictors (seeMaterials and Methods supplementary Methods and sup-plementary fig 1 Supplementary Material online) for gene-calling The final genome annotation resulted in 24140 pro-tein-coding genes including 5446 with 50-untranslatedregions (UTRs) and 6863 with 30-UTRs We detected 15465one-to-one orthologs shared with human and 14718 withcow When we compared gene annotations across a sampleof mammalian genomes the humpback whale and bottle-nose dolphin genome assemblies had on average significantlyshorter introns (Pfrac14 004 unpaired T-test supplementary ta-ble 1 Supplementary Material online) which may in partexplain the smaller genome size of cetaceans comparedwith most other mammals (Zhang and Edwards 2012)

We estimated that between 30 and 39 of thehumpback whale genome comprised repetitive elements(table 3) Masking the assembly with a library of knownmammalian elements resulted in the identification of morerepeats than a de novo method suggesting that clade-specific repeat libraries are highly valuable when assessingrepetitive content The most abundant group of transpos-able elements in the humpback whale genome was theautonomous non-long terminal repeat (LTR) retrotranspo-sons (long interspersed nuclear elements or LINEs) whichcomprised nearly 20 of the genome most of which be-long to the LINE-1 clade as is typical of placental mammals(Boissinot and Sookdeo 2016) Large numbers of nonau-tonomous non-LTR retrotransposons in the form of shortinterspersed nuclear elements (SINEs) were also detectedin particular over 3 of the genome belonged to mamma-lian inverted repeats Although the divergence profile of denovo-derived repeat annotations in humpback whale in-cluded a decreased average genetic distance within trans-posable element subfamilies compared with the database-derived repeat landscape both repeat libraries displayed aspike in the numbers of LINE-1 and SINE retrotransposonsubfamilies near 5 divergence as did the repeat land-scapes of the bowhead whale orca and dolphin suggestingrecent retrotransposon activity in cetaceans (supplemen-tary figs 2 and 3 Supplementary Material online)

Slow DNA Substitution Rates in Cetaceans and theDivergence of Modern Whale LineagesWe computed a whole-genome alignment (WGA) of 12mammals including opossum elephant human mousedog cow sperm whale bottlenose dolphin orca bowheadwhale common minke whale and humpback whale (supple-mentary table 2 Supplementary Material online) andemployed human gene annotations to extract 2763828 ho-mologous 4-fold degenerate (4D) sites A phylogenetic anal-ysis of the 4D sites yielded the recognized evolutionaryrelationships (fig 1A) including reciprocally monophyleticMysticeti and Odontoceti When we compared the substitu-tions per site along the branches of the phylogeny we founda larger number of substitutions along the mouse

Table 1 Genomic Sequence Data Obtained for the Humpback WhaleGenome

Libraries Est Numberof Reads

Avg ReadLength

(bp)

EstDepth(total)

180 bp paired-end 1211320000 94 413300 bp paired-end 25820000 123 12500 bp paired-end 112400000 123 50600 bp paired-end 395500000 93 1342 kb mate-paired 348080000 49 6210 kb mate-paired 279000000 94 90

Subtotal for WGS libraries 2372120000 761Chicago Library 1 72000000 100 53Chicago Library 2 6000000 151 07Chicago Library 3 190000000 100 139Chicago Library 4 79000000 100 58Subtotal for Chicago Libraries 347000000 256Total for all sequence

libraries2719120000 1017

NOTEmdashWGS whole-genome shotgun

Table 2 Statistics for the Humpback Whale Genome Assembly

Feature Initial Assembly Final Assembly

Assembly length 227 Gb 227 GbContig N50 124 kb 123 kbLongest scaffold 22 Mb 294 MbNumber of scaffolds 24319 2558Scaffold N50 198 kb 914 MbScaffold N90 53 kb 235 MbScaffold L50 3214 79Scaffold L90 11681 273Percent genome

in gaps536 545

BUSCOa resultsmdashvertebrata

C 85[D 18] F 15 M 49 n 3023

BUSCOa resultsmdashlaurasiatheria

C 912[D 08] F 48 M 40 n 6253

CEGMAa results C 226 (9113) P 240 (9677)

NOTEmdashBUSCO Benchmarking Universal Single Copy Orthologs (Sim~ao et al 2015)C complete D duplicated F fragmented M missing CEGMA Core EukaryoticGenes Mapping Approach (Parra et al 2009) C complete P complete andorpartialaBUSCO and CEGMA results for final assembly only

Tollis et al doi101093molbevmsz099 MBE

1748

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

(rodent) branch supporting negative relationships betweengeneration time (Wilson Sayres et al 2011) speciation (Pagelet al 2006) and substitution rates When applying a semi-parametric penalized likelihood (PL) method to estimate sub-stitution rate variation at 4D sites across the 12 mammals we

found that cetaceans have accumulated the lowest numberof DNA substitutions per site per million years (fig 1B andsupplementary table 3 Supplementary Material online)which may be attributed to long generation times or slowermutation rates in cetaceans (Jackson et al 2009) Germlinemutation rates are related to somatic mutation rates withinspecies (Milholland et al 2017) therefore it is possible thatslow mutation rates may limit neoplastic progression andcontribute to cancer suppression in cetaceans which is aprediction of Petorsquos Paradox (Caulin and Maley 2011)

We also obtained 152 single-copy orthologs (single-geneortholog families or SGOs see Materials and Methods andsupplementary Methods Supplementary Material online)identified in at least 24 out of 28 species totaling314844 bp and reconstructed gene trees that were binnedand analyzed using a species tree method that incorporatesincomplete lineage sorting (see Materials and MethodsZhang et al 2018) The species tree topology (supplementaryfig 4 Supplementary Material online) also included full sup-port for the accepted phylogenetic relationships withinCetacea as well as within Mysticeti and Odontoceti Lowerlocal posterior probabilities for two of the internal brancheswithin laurasiatherian mammals were likely due to the exten-sive gene tree heterogeneity that has complicated phyloge-netic reconstruction of the placental mammalian lineages(Tarver et al 2016)

We estimated divergence times in a Bayesian frameworkusing the 4D and SGO data sets independently in MCMCtree(Yang and Rannala 2006) resulting in similar posterior distri-butions and parameter estimates with overlapping highestposterior densities for the estimated divergence times ofshared nodes across the 4D and SGO phylogenies (supple-mentary figs 5 and 6 and tables 4 and 5 SupplementaryMaterial online) We estimated that the time to the mostrecent common ancestor (TMRCA) of placental mammalswas 100ndash114 Ma during the late Cretaceous the TMRCA ofcow and cetaceans (Cetartiodactyla) was 52ndash65 Ma duringthe Eocene or Paleocene the TMRCA of extant cetaceans was29ndash35 Ma during the early Oligocene or late Eocene (betweenthe two data sets) the TMRCA of baleen whales was placed9ndash26 Ma in the early Miocene or middle Oligocene and theTMRCA of humpback and common minke whales (familyBalaenopteridae) was 4ndash22 Ma during the early Pliocene orthe Miocene (fig 2A)

Table 3 Repetitive Content of the Humpback Whale (Megaptera novaeangliae) Genome Estimated with a Library of Known Mammalian Repeats(RepBase) and De Novo Repeat Identification (RepeatModeler)

RepBase RepeatModeler

Repeat Type Length (bp) Genome(3885 total)

Length (bp) Genome(3025 total)

SINEs 137574621 607 75509694 333LINEs 440955223 1946 432017456 1907LTR 142117286 627 94177184 416DNA transposons 84243186 372 54015996 238Unclassified 1303231 006 4339463 019Satellites 48894580 216 197862 001Simple repeats 20779839 092 20848394 092Low complexity 4167187 018 4281173 019

A

B

FIG 1 Substitution rates in cetacean genomes (A) Maximum likeli-hood phylogeny of 12 mammals based on 2763828 fourfold degen-erate sites Branch lengths are given in terms of substitutions per siteexcept for branches with hatched lines which are shortened for visualconvenience All branches received 100 bootstrap support (B)Based on the phylogeny in (A) a comparison of the estimated DNAsubstitution rates (in terms of substitutions per site per million years)between terminal and internal cetacean branches and terminal andinternal branches of all other mammals

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1749

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

FIG 2 Timescale of humpback whale evolution (A) Species phylogeny of 28 mammals constructed from 152 orthologs and time-calibrated usingMCMCtree Branch lengths are in terms of millions of years Node bars indicate 95 highest posterior densities of divergence times Cetaceans arehighlighted in the gray box with mean estimates of divergence times included (B) The effective population size (Ne) changes over timeDemographic histories of two North Atlantic humpback whales estimated from the PSMC analysis including 100 bootstrap replicates per analysisMutation rate used was 154e-9 per year and generation time used was 215 years

Tollis et al doi101093molbevmsz099 MBE

1750

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 2: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

Return to the Sea Get Huge Beat Cancer An Analysis ofCetacean Genomes Including an Assembly for the HumpbackWhale (Megaptera novaeangliae)

Marc Tollis123 Jooke Robbins4 Andrew E Webb5 Lukas FK Kuderna6 Aleah F Caulin7

Jacinda D Garcia2 Martine Berube48 Nader Pourmand9 Tomas Marques-Bonet6101112

Mary J OrsquoConnell13 Per J Palsboslashlldagger48 and Carlo C Maleydagger12

1Biodesign Institute Arizona State University Tempe AZ2School of Life Sciences Arizona State University Tempe AZ3School of Informatics Computing and Cyber Systems Northern Arizona University Flagstaff AZ4Center for Coastal Studies Provincetown MA5Center for Computational Genetics and Genomics Temple University Philadelphia PA6Instituto de Biologia Evolutiva (UPF-CSIC) PRBB Barcelona Spain7Genomics and Computational Biology Program University of Pennsylvania Philadelphia PA8Groningen Institute of Evolutionary Life Sciences University of Groningen Groningen The Netherlands9Jack Baskin School of Engineering University of California Santa Cruz Santa Cruz CA10CNAG-CRG Centre for Genomic Regulation (CRG) The Barcelona Institute of Science and Technology Barcelona Spain11Institucio Catalana de Recerca i Estudis Avancats (ICREA) Barcelona Catalonia Spain12Institut Catala de Paleontologia Miquel Crusafont Universitat Autonoma de Barcelona Edifici ICTA-ICP Barcelona Spain13Computational and Molecular Evolutionary Biology Research Group School of Life Sciences University of Nottingham NottinghamUnited KingdomdaggerThese authors contributed equally to this work

Corresponding author E-mail marctollisnauedu

Associate editor Beth Shapiro

Abstract

Cetaceans are a clade of highly specialized aquatic mammals that include the largest animals that have ever lived Thelargest whales can have1000more cells than a human with long lifespans leaving them theoretically susceptible tocancer However large-bodied and long-lived animals do not suffer higher risks of cancer mortality than humansmdashanobservation known as Petorsquos Paradox To investigate the genomic bases of gigantism and other cetacean adaptations wegenerated a de novo genome assembly for the humpback whale (Megaptera novaeangliae) and incorporated the genomesof ten cetacean species in a comparative analysis We found further evidence that rorquals (family Balaenopteridae)radiated during the Miocene or earlier and inferred that perturbations in abundance andor the interocean connectivityof North Atlantic humpback whale populations likely occurred throughout the Pleistocene Our comparative genomicresults suggest that the evolution of cetacean gigantism was accompanied by strong selection on pathways that aredirectly linked to cancer Large segmental duplications in whale genomes contained genes controlling the apoptoticpathway and genes inferred to be under accelerated evolution and positive selection in cetaceans were enriched forbiological processes such as cell cycle checkpoint cell signaling and proliferation We also inferred positive selection ongenes controlling the mammalian appendicular and cranial skeletal elements in the cetacean lineage which are relevantto extensive anatomical changes during cetacean evolution Genomic analyses shed light on the molecular mechanismsunderlying cetacean traits including gigantism and will contribute to the development of future targets for humancancer therapies

Key words cetaceans humpback whale evolution genome cancer

IntroductionCetaceans (whales dolphins and porpoises) are highly spe-cialized mammals adapted to an aquatic lifestyle Divergingfrom land-dwelling artiodactyls during the late Paleocene or

early Eocene 55 Ma (Thewissen et al 2007 OrsquoLeary andGatesy 2008) cetaceans diversified throughout theCenozoic and include two extant groups Mysticeti or thebaleen whales and Odontoceti or the toothed whales

Article

The Author(s) 2019 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited Open Access1746 Mol Biol Evol 36(8)1746ndash1763 doi101093molbevmsz099 Advance Access publication May 9 2019

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Traits evolved for life in the ocean including the loss of hindlimbs changes in skull morphology physiological adaptationsfor deep diving and underwater acoustic abilities includingecholocation make these species among the most divergedmammals from the ancestral eutherian (Berta et al 2015)One striking aspect of cetacean evolution is the large bodysizes achieved by some lineages rivaled only by the giganticterrestrial sauropod dinosaurs (Benson et al 2014) Cetaceanswere not limited by gravity in the buoyant marine environ-ment and evolved multiple giant forms exemplified today bythe largest animal that has ever lived the blue whale(Balaenoptera musculus) Based on evidence from fossils mol-ecules and historical climate data it has been hypothesizedthat oceanic upwelling during the PliocenendashPleistocene sup-ported the suspension feeding typical of modern baleenwhales allowing them to reach their gigantic sizes surprisinglyclose to the present time (Slater et al 2017)

Although the largest whales arose relatively recently largebody size has evolved multiple times throughout the historyof life (Heim et al 2015) including in 10 out of 11 mammalianorders (Baker et al 2015) Animal gigantism is therefore arecurring phenomenon that is seemingly governed by avail-able resources and natural selection (Vermeij 2016) wherepositive fitness consequences lead to repeated directionalselection toward larger bodies within populations(Kingsolver and Pfennig 2004) However there are tradeoffsassociated with large body size including a higher lifetime riskof cancer due to a greater number of somatic cell divisionsover time (Peto et al 1975 Nunney 2018) Surprisingly al-though cancer should be a body mass- and age-related dis-ease large and long-lived animals do not suffer higher cancermortality rates than smaller shorter-lived animals (Abegglenet al 2015) This is a phenomenon known as Petorsquos Paradox(Peto et al 1975) To the extent that there has been selectionfor large body size there likely has also been selection forcancer suppression mechanisms that allow an organism togrow large and successfully reproduce Recent efforts havesought to understand the genomic mechanisms responsiblefor cancer suppression in gigantic species (Abegglen et al2015 Caulin et al 2015 Keane et al 2015 Sulak et al 2016)An enhanced DNA damage response in elephant cells hasbeen attributed to20 duplications of the tumor suppressorgene TP53 in elephant genomes (Abegglen et al 2015 Sulaket al 2016) The bowhead whale (Balaena mysticetus) is alarge whale that may live more than 200 years (Georgeet al 1999) and its genome shows evidence of positive selec-tion in many cancer- and aging-associated genes includingERCC1 which is part of the DNA repair pathway (Keane et al2015) Additionally the bowhead whale genome containsduplications of the DNA repair gene PCNA as well asLAMTOR1 which helps control cellular growth (Keane et al2015) Altogether these results suggest that 1) the genomesof larger and longer-lived mammals may hold the key tomultiple mechanisms for suppressing cancer and 2) as thelargest animals on Earth whales make very promising sourcesof insight for cancer suppression research

Cetacean comparative genomics is a rapidly growing fieldwith 13 complete genome assemblies available on NCBI as of

late 2018 including the following that were available at theonset of this study the common minke whale (Balaenopteraacutorostrata) (Yim et al 2014) bottlenose dolphin (Tursiopstruncatus) orca (Orcinus orca) (Foote et al 2015) and spermwhale (Physeter macrocephalus) (Warren et al 2017) In ad-dition the Bowhead Whale Genome Resource has supportedthe genome assembly for that species since 2015 (Keane et al2015) However to date few studies have used multiple ce-tacean genomes to address questions about genetic changesthat have controlled adaptations during cetacean evolutionincluding the evolution of cancer suppression Here we pro-vide a comparative analysis that is novel in scope leveragingwhole-genome data from ten cetacean species including sixcetacean genome assemblies and a de novo genome assem-bly for the humpback whale (Megaptera novaeangliae)Humpback whales are members of the familyBalaenopteridae (rorquals) and share a recent evolutionaryhistory with other ocean giants such as the blue whale and finwhale (Balaenoptera physalus) (Arnason et al 2018) Theyhave an average adult length of more than 13 m (Claphamand Mead 1999) and a lifespan that may extend to 95 years(Chittleborough 1959 Gabriele et al 2010) making the spe-cies an excellent model for Petorsquos Paradox research

Our goals in this study were 3-fold 1) to provide a de novogenome assembly and annotation for the humpback whalethat will be useful to the cetacean research and mammaliancomparative genomics communities 2) to leverage the geno-mic resource and investigate the molecular evolution of ceta-ceans in terms of their population demographicsphylogenetic relationships and species divergence timesand the genomics underlying cetacean-specific adaptationsand 3) to determine how selective pressure variation on genesinvolved with cell cycle control cell signaling and prolifera-tion and many other pathways relevant to cancer may havecontributed to the evolution of cetacean gigantism The latterhas the potential to generate research avenues for improvinghuman cancer prevention and perhaps even therapies

Results and Discussion

Sequencing Assembly and Annotation of theHumpback Whale GenomeWe sequenced and assembled a reference genome for thehumpback whale using high-coverage paired-end and mate-pair libraries (table 1 NCBI BioProject PRJNA509641) andobtained an initial assembly that was 227 Gb in lengthwith 24319 scaffolds a contig N50 length of 125 kb and ascaffold N50 length of 198 kb Final sequence coverage forthe initial assembly was 76 assuming an estimated ge-nome size of 274 Gb from a 27-mer spectrum analysis Hi Risescaffolding using proximity ligation (Chicago) libraries(Putnam et al 2016 table 1 NCBI BioProject PRJNA509641)resulted in a final sequence coverage of 102 greatly im-proving the contiguity of the assembly by reducing the num-ber of scaffolds to 2558 and increasing the scaffold N50length 46-fold to 914 Mb (table 2) The discrepancy betweenestimated genome size and assembly length has been ob-served in other cetacean genome efforts (Keane et al 2015)

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1747

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

and is likely due to the highly repetitive nature of cetaceangenomes (Arnason and Widegren 1989)

With 95ndash96 of near-universal orthologs from OrthoDBv9 (Sim~ao et al 2015) present in the assembly as well as 97of a set of core eukaryotic genes (Parra et al 2009) the esti-mated gene content of the humpback whale genome assem-bly suggests a high-quality genome with good generepresentation (table 1) To aid in genome annotation wecarried out skin transcriptome sequencing which resulted in281642354 reads (NCBI BioProject PRJNA509641) Thesewere assembled into a transcriptome that includes 67 ofboth vertebrate and laurasiatherian orthologs and we pre-dicted 10167 protein-coding genes with likely ORFs that con-tain BLAST homology to SwissProt proteins (UniProtConsortium 2015) The large number of missing genes from

the transcriptome may be due to the small proportion ofgenes expressed in skin Therefore we also assessed homologywith ten mammalian proteomes from NCBI and the entireSwissProt database and ab initio gene predictors (seeMaterials and Methods supplementary Methods and sup-plementary fig 1 Supplementary Material online) for gene-calling The final genome annotation resulted in 24140 pro-tein-coding genes including 5446 with 50-untranslatedregions (UTRs) and 6863 with 30-UTRs We detected 15465one-to-one orthologs shared with human and 14718 withcow When we compared gene annotations across a sampleof mammalian genomes the humpback whale and bottle-nose dolphin genome assemblies had on average significantlyshorter introns (Pfrac14 004 unpaired T-test supplementary ta-ble 1 Supplementary Material online) which may in partexplain the smaller genome size of cetaceans comparedwith most other mammals (Zhang and Edwards 2012)

We estimated that between 30 and 39 of thehumpback whale genome comprised repetitive elements(table 3) Masking the assembly with a library of knownmammalian elements resulted in the identification of morerepeats than a de novo method suggesting that clade-specific repeat libraries are highly valuable when assessingrepetitive content The most abundant group of transpos-able elements in the humpback whale genome was theautonomous non-long terminal repeat (LTR) retrotranspo-sons (long interspersed nuclear elements or LINEs) whichcomprised nearly 20 of the genome most of which be-long to the LINE-1 clade as is typical of placental mammals(Boissinot and Sookdeo 2016) Large numbers of nonau-tonomous non-LTR retrotransposons in the form of shortinterspersed nuclear elements (SINEs) were also detectedin particular over 3 of the genome belonged to mamma-lian inverted repeats Although the divergence profile of denovo-derived repeat annotations in humpback whale in-cluded a decreased average genetic distance within trans-posable element subfamilies compared with the database-derived repeat landscape both repeat libraries displayed aspike in the numbers of LINE-1 and SINE retrotransposonsubfamilies near 5 divergence as did the repeat land-scapes of the bowhead whale orca and dolphin suggestingrecent retrotransposon activity in cetaceans (supplemen-tary figs 2 and 3 Supplementary Material online)

Slow DNA Substitution Rates in Cetaceans and theDivergence of Modern Whale LineagesWe computed a whole-genome alignment (WGA) of 12mammals including opossum elephant human mousedog cow sperm whale bottlenose dolphin orca bowheadwhale common minke whale and humpback whale (supple-mentary table 2 Supplementary Material online) andemployed human gene annotations to extract 2763828 ho-mologous 4-fold degenerate (4D) sites A phylogenetic anal-ysis of the 4D sites yielded the recognized evolutionaryrelationships (fig 1A) including reciprocally monophyleticMysticeti and Odontoceti When we compared the substitu-tions per site along the branches of the phylogeny we founda larger number of substitutions along the mouse

Table 1 Genomic Sequence Data Obtained for the Humpback WhaleGenome

Libraries Est Numberof Reads

Avg ReadLength

(bp)

EstDepth(total)

180 bp paired-end 1211320000 94 413300 bp paired-end 25820000 123 12500 bp paired-end 112400000 123 50600 bp paired-end 395500000 93 1342 kb mate-paired 348080000 49 6210 kb mate-paired 279000000 94 90

Subtotal for WGS libraries 2372120000 761Chicago Library 1 72000000 100 53Chicago Library 2 6000000 151 07Chicago Library 3 190000000 100 139Chicago Library 4 79000000 100 58Subtotal for Chicago Libraries 347000000 256Total for all sequence

libraries2719120000 1017

NOTEmdashWGS whole-genome shotgun

Table 2 Statistics for the Humpback Whale Genome Assembly

Feature Initial Assembly Final Assembly

Assembly length 227 Gb 227 GbContig N50 124 kb 123 kbLongest scaffold 22 Mb 294 MbNumber of scaffolds 24319 2558Scaffold N50 198 kb 914 MbScaffold N90 53 kb 235 MbScaffold L50 3214 79Scaffold L90 11681 273Percent genome

in gaps536 545

BUSCOa resultsmdashvertebrata

C 85[D 18] F 15 M 49 n 3023

BUSCOa resultsmdashlaurasiatheria

C 912[D 08] F 48 M 40 n 6253

CEGMAa results C 226 (9113) P 240 (9677)

NOTEmdashBUSCO Benchmarking Universal Single Copy Orthologs (Sim~ao et al 2015)C complete D duplicated F fragmented M missing CEGMA Core EukaryoticGenes Mapping Approach (Parra et al 2009) C complete P complete andorpartialaBUSCO and CEGMA results for final assembly only

Tollis et al doi101093molbevmsz099 MBE

1748

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

(rodent) branch supporting negative relationships betweengeneration time (Wilson Sayres et al 2011) speciation (Pagelet al 2006) and substitution rates When applying a semi-parametric penalized likelihood (PL) method to estimate sub-stitution rate variation at 4D sites across the 12 mammals we

found that cetaceans have accumulated the lowest numberof DNA substitutions per site per million years (fig 1B andsupplementary table 3 Supplementary Material online)which may be attributed to long generation times or slowermutation rates in cetaceans (Jackson et al 2009) Germlinemutation rates are related to somatic mutation rates withinspecies (Milholland et al 2017) therefore it is possible thatslow mutation rates may limit neoplastic progression andcontribute to cancer suppression in cetaceans which is aprediction of Petorsquos Paradox (Caulin and Maley 2011)

We also obtained 152 single-copy orthologs (single-geneortholog families or SGOs see Materials and Methods andsupplementary Methods Supplementary Material online)identified in at least 24 out of 28 species totaling314844 bp and reconstructed gene trees that were binnedand analyzed using a species tree method that incorporatesincomplete lineage sorting (see Materials and MethodsZhang et al 2018) The species tree topology (supplementaryfig 4 Supplementary Material online) also included full sup-port for the accepted phylogenetic relationships withinCetacea as well as within Mysticeti and Odontoceti Lowerlocal posterior probabilities for two of the internal brancheswithin laurasiatherian mammals were likely due to the exten-sive gene tree heterogeneity that has complicated phyloge-netic reconstruction of the placental mammalian lineages(Tarver et al 2016)

We estimated divergence times in a Bayesian frameworkusing the 4D and SGO data sets independently in MCMCtree(Yang and Rannala 2006) resulting in similar posterior distri-butions and parameter estimates with overlapping highestposterior densities for the estimated divergence times ofshared nodes across the 4D and SGO phylogenies (supple-mentary figs 5 and 6 and tables 4 and 5 SupplementaryMaterial online) We estimated that the time to the mostrecent common ancestor (TMRCA) of placental mammalswas 100ndash114 Ma during the late Cretaceous the TMRCA ofcow and cetaceans (Cetartiodactyla) was 52ndash65 Ma duringthe Eocene or Paleocene the TMRCA of extant cetaceans was29ndash35 Ma during the early Oligocene or late Eocene (betweenthe two data sets) the TMRCA of baleen whales was placed9ndash26 Ma in the early Miocene or middle Oligocene and theTMRCA of humpback and common minke whales (familyBalaenopteridae) was 4ndash22 Ma during the early Pliocene orthe Miocene (fig 2A)

Table 3 Repetitive Content of the Humpback Whale (Megaptera novaeangliae) Genome Estimated with a Library of Known Mammalian Repeats(RepBase) and De Novo Repeat Identification (RepeatModeler)

RepBase RepeatModeler

Repeat Type Length (bp) Genome(3885 total)

Length (bp) Genome(3025 total)

SINEs 137574621 607 75509694 333LINEs 440955223 1946 432017456 1907LTR 142117286 627 94177184 416DNA transposons 84243186 372 54015996 238Unclassified 1303231 006 4339463 019Satellites 48894580 216 197862 001Simple repeats 20779839 092 20848394 092Low complexity 4167187 018 4281173 019

A

B

FIG 1 Substitution rates in cetacean genomes (A) Maximum likeli-hood phylogeny of 12 mammals based on 2763828 fourfold degen-erate sites Branch lengths are given in terms of substitutions per siteexcept for branches with hatched lines which are shortened for visualconvenience All branches received 100 bootstrap support (B)Based on the phylogeny in (A) a comparison of the estimated DNAsubstitution rates (in terms of substitutions per site per million years)between terminal and internal cetacean branches and terminal andinternal branches of all other mammals

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1749

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

FIG 2 Timescale of humpback whale evolution (A) Species phylogeny of 28 mammals constructed from 152 orthologs and time-calibrated usingMCMCtree Branch lengths are in terms of millions of years Node bars indicate 95 highest posterior densities of divergence times Cetaceans arehighlighted in the gray box with mean estimates of divergence times included (B) The effective population size (Ne) changes over timeDemographic histories of two North Atlantic humpback whales estimated from the PSMC analysis including 100 bootstrap replicates per analysisMutation rate used was 154e-9 per year and generation time used was 215 years

Tollis et al doi101093molbevmsz099 MBE

1750

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 3: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

Traits evolved for life in the ocean including the loss of hindlimbs changes in skull morphology physiological adaptationsfor deep diving and underwater acoustic abilities includingecholocation make these species among the most divergedmammals from the ancestral eutherian (Berta et al 2015)One striking aspect of cetacean evolution is the large bodysizes achieved by some lineages rivaled only by the giganticterrestrial sauropod dinosaurs (Benson et al 2014) Cetaceanswere not limited by gravity in the buoyant marine environ-ment and evolved multiple giant forms exemplified today bythe largest animal that has ever lived the blue whale(Balaenoptera musculus) Based on evidence from fossils mol-ecules and historical climate data it has been hypothesizedthat oceanic upwelling during the PliocenendashPleistocene sup-ported the suspension feeding typical of modern baleenwhales allowing them to reach their gigantic sizes surprisinglyclose to the present time (Slater et al 2017)

Although the largest whales arose relatively recently largebody size has evolved multiple times throughout the historyof life (Heim et al 2015) including in 10 out of 11 mammalianorders (Baker et al 2015) Animal gigantism is therefore arecurring phenomenon that is seemingly governed by avail-able resources and natural selection (Vermeij 2016) wherepositive fitness consequences lead to repeated directionalselection toward larger bodies within populations(Kingsolver and Pfennig 2004) However there are tradeoffsassociated with large body size including a higher lifetime riskof cancer due to a greater number of somatic cell divisionsover time (Peto et al 1975 Nunney 2018) Surprisingly al-though cancer should be a body mass- and age-related dis-ease large and long-lived animals do not suffer higher cancermortality rates than smaller shorter-lived animals (Abegglenet al 2015) This is a phenomenon known as Petorsquos Paradox(Peto et al 1975) To the extent that there has been selectionfor large body size there likely has also been selection forcancer suppression mechanisms that allow an organism togrow large and successfully reproduce Recent efforts havesought to understand the genomic mechanisms responsiblefor cancer suppression in gigantic species (Abegglen et al2015 Caulin et al 2015 Keane et al 2015 Sulak et al 2016)An enhanced DNA damage response in elephant cells hasbeen attributed to20 duplications of the tumor suppressorgene TP53 in elephant genomes (Abegglen et al 2015 Sulaket al 2016) The bowhead whale (Balaena mysticetus) is alarge whale that may live more than 200 years (Georgeet al 1999) and its genome shows evidence of positive selec-tion in many cancer- and aging-associated genes includingERCC1 which is part of the DNA repair pathway (Keane et al2015) Additionally the bowhead whale genome containsduplications of the DNA repair gene PCNA as well asLAMTOR1 which helps control cellular growth (Keane et al2015) Altogether these results suggest that 1) the genomesof larger and longer-lived mammals may hold the key tomultiple mechanisms for suppressing cancer and 2) as thelargest animals on Earth whales make very promising sourcesof insight for cancer suppression research

Cetacean comparative genomics is a rapidly growing fieldwith 13 complete genome assemblies available on NCBI as of

late 2018 including the following that were available at theonset of this study the common minke whale (Balaenopteraacutorostrata) (Yim et al 2014) bottlenose dolphin (Tursiopstruncatus) orca (Orcinus orca) (Foote et al 2015) and spermwhale (Physeter macrocephalus) (Warren et al 2017) In ad-dition the Bowhead Whale Genome Resource has supportedthe genome assembly for that species since 2015 (Keane et al2015) However to date few studies have used multiple ce-tacean genomes to address questions about genetic changesthat have controlled adaptations during cetacean evolutionincluding the evolution of cancer suppression Here we pro-vide a comparative analysis that is novel in scope leveragingwhole-genome data from ten cetacean species including sixcetacean genome assemblies and a de novo genome assem-bly for the humpback whale (Megaptera novaeangliae)Humpback whales are members of the familyBalaenopteridae (rorquals) and share a recent evolutionaryhistory with other ocean giants such as the blue whale and finwhale (Balaenoptera physalus) (Arnason et al 2018) Theyhave an average adult length of more than 13 m (Claphamand Mead 1999) and a lifespan that may extend to 95 years(Chittleborough 1959 Gabriele et al 2010) making the spe-cies an excellent model for Petorsquos Paradox research

Our goals in this study were 3-fold 1) to provide a de novogenome assembly and annotation for the humpback whalethat will be useful to the cetacean research and mammaliancomparative genomics communities 2) to leverage the geno-mic resource and investigate the molecular evolution of ceta-ceans in terms of their population demographicsphylogenetic relationships and species divergence timesand the genomics underlying cetacean-specific adaptationsand 3) to determine how selective pressure variation on genesinvolved with cell cycle control cell signaling and prolifera-tion and many other pathways relevant to cancer may havecontributed to the evolution of cetacean gigantism The latterhas the potential to generate research avenues for improvinghuman cancer prevention and perhaps even therapies

Results and Discussion

Sequencing Assembly and Annotation of theHumpback Whale GenomeWe sequenced and assembled a reference genome for thehumpback whale using high-coverage paired-end and mate-pair libraries (table 1 NCBI BioProject PRJNA509641) andobtained an initial assembly that was 227 Gb in lengthwith 24319 scaffolds a contig N50 length of 125 kb and ascaffold N50 length of 198 kb Final sequence coverage forthe initial assembly was 76 assuming an estimated ge-nome size of 274 Gb from a 27-mer spectrum analysis Hi Risescaffolding using proximity ligation (Chicago) libraries(Putnam et al 2016 table 1 NCBI BioProject PRJNA509641)resulted in a final sequence coverage of 102 greatly im-proving the contiguity of the assembly by reducing the num-ber of scaffolds to 2558 and increasing the scaffold N50length 46-fold to 914 Mb (table 2) The discrepancy betweenestimated genome size and assembly length has been ob-served in other cetacean genome efforts (Keane et al 2015)

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1747

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

and is likely due to the highly repetitive nature of cetaceangenomes (Arnason and Widegren 1989)

With 95ndash96 of near-universal orthologs from OrthoDBv9 (Sim~ao et al 2015) present in the assembly as well as 97of a set of core eukaryotic genes (Parra et al 2009) the esti-mated gene content of the humpback whale genome assem-bly suggests a high-quality genome with good generepresentation (table 1) To aid in genome annotation wecarried out skin transcriptome sequencing which resulted in281642354 reads (NCBI BioProject PRJNA509641) Thesewere assembled into a transcriptome that includes 67 ofboth vertebrate and laurasiatherian orthologs and we pre-dicted 10167 protein-coding genes with likely ORFs that con-tain BLAST homology to SwissProt proteins (UniProtConsortium 2015) The large number of missing genes from

the transcriptome may be due to the small proportion ofgenes expressed in skin Therefore we also assessed homologywith ten mammalian proteomes from NCBI and the entireSwissProt database and ab initio gene predictors (seeMaterials and Methods supplementary Methods and sup-plementary fig 1 Supplementary Material online) for gene-calling The final genome annotation resulted in 24140 pro-tein-coding genes including 5446 with 50-untranslatedregions (UTRs) and 6863 with 30-UTRs We detected 15465one-to-one orthologs shared with human and 14718 withcow When we compared gene annotations across a sampleof mammalian genomes the humpback whale and bottle-nose dolphin genome assemblies had on average significantlyshorter introns (Pfrac14 004 unpaired T-test supplementary ta-ble 1 Supplementary Material online) which may in partexplain the smaller genome size of cetaceans comparedwith most other mammals (Zhang and Edwards 2012)

We estimated that between 30 and 39 of thehumpback whale genome comprised repetitive elements(table 3) Masking the assembly with a library of knownmammalian elements resulted in the identification of morerepeats than a de novo method suggesting that clade-specific repeat libraries are highly valuable when assessingrepetitive content The most abundant group of transpos-able elements in the humpback whale genome was theautonomous non-long terminal repeat (LTR) retrotranspo-sons (long interspersed nuclear elements or LINEs) whichcomprised nearly 20 of the genome most of which be-long to the LINE-1 clade as is typical of placental mammals(Boissinot and Sookdeo 2016) Large numbers of nonau-tonomous non-LTR retrotransposons in the form of shortinterspersed nuclear elements (SINEs) were also detectedin particular over 3 of the genome belonged to mamma-lian inverted repeats Although the divergence profile of denovo-derived repeat annotations in humpback whale in-cluded a decreased average genetic distance within trans-posable element subfamilies compared with the database-derived repeat landscape both repeat libraries displayed aspike in the numbers of LINE-1 and SINE retrotransposonsubfamilies near 5 divergence as did the repeat land-scapes of the bowhead whale orca and dolphin suggestingrecent retrotransposon activity in cetaceans (supplemen-tary figs 2 and 3 Supplementary Material online)

Slow DNA Substitution Rates in Cetaceans and theDivergence of Modern Whale LineagesWe computed a whole-genome alignment (WGA) of 12mammals including opossum elephant human mousedog cow sperm whale bottlenose dolphin orca bowheadwhale common minke whale and humpback whale (supple-mentary table 2 Supplementary Material online) andemployed human gene annotations to extract 2763828 ho-mologous 4-fold degenerate (4D) sites A phylogenetic anal-ysis of the 4D sites yielded the recognized evolutionaryrelationships (fig 1A) including reciprocally monophyleticMysticeti and Odontoceti When we compared the substitu-tions per site along the branches of the phylogeny we founda larger number of substitutions along the mouse

Table 1 Genomic Sequence Data Obtained for the Humpback WhaleGenome

Libraries Est Numberof Reads

Avg ReadLength

(bp)

EstDepth(total)

180 bp paired-end 1211320000 94 413300 bp paired-end 25820000 123 12500 bp paired-end 112400000 123 50600 bp paired-end 395500000 93 1342 kb mate-paired 348080000 49 6210 kb mate-paired 279000000 94 90

Subtotal for WGS libraries 2372120000 761Chicago Library 1 72000000 100 53Chicago Library 2 6000000 151 07Chicago Library 3 190000000 100 139Chicago Library 4 79000000 100 58Subtotal for Chicago Libraries 347000000 256Total for all sequence

libraries2719120000 1017

NOTEmdashWGS whole-genome shotgun

Table 2 Statistics for the Humpback Whale Genome Assembly

Feature Initial Assembly Final Assembly

Assembly length 227 Gb 227 GbContig N50 124 kb 123 kbLongest scaffold 22 Mb 294 MbNumber of scaffolds 24319 2558Scaffold N50 198 kb 914 MbScaffold N90 53 kb 235 MbScaffold L50 3214 79Scaffold L90 11681 273Percent genome

in gaps536 545

BUSCOa resultsmdashvertebrata

C 85[D 18] F 15 M 49 n 3023

BUSCOa resultsmdashlaurasiatheria

C 912[D 08] F 48 M 40 n 6253

CEGMAa results C 226 (9113) P 240 (9677)

NOTEmdashBUSCO Benchmarking Universal Single Copy Orthologs (Sim~ao et al 2015)C complete D duplicated F fragmented M missing CEGMA Core EukaryoticGenes Mapping Approach (Parra et al 2009) C complete P complete andorpartialaBUSCO and CEGMA results for final assembly only

Tollis et al doi101093molbevmsz099 MBE

1748

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

(rodent) branch supporting negative relationships betweengeneration time (Wilson Sayres et al 2011) speciation (Pagelet al 2006) and substitution rates When applying a semi-parametric penalized likelihood (PL) method to estimate sub-stitution rate variation at 4D sites across the 12 mammals we

found that cetaceans have accumulated the lowest numberof DNA substitutions per site per million years (fig 1B andsupplementary table 3 Supplementary Material online)which may be attributed to long generation times or slowermutation rates in cetaceans (Jackson et al 2009) Germlinemutation rates are related to somatic mutation rates withinspecies (Milholland et al 2017) therefore it is possible thatslow mutation rates may limit neoplastic progression andcontribute to cancer suppression in cetaceans which is aprediction of Petorsquos Paradox (Caulin and Maley 2011)

We also obtained 152 single-copy orthologs (single-geneortholog families or SGOs see Materials and Methods andsupplementary Methods Supplementary Material online)identified in at least 24 out of 28 species totaling314844 bp and reconstructed gene trees that were binnedand analyzed using a species tree method that incorporatesincomplete lineage sorting (see Materials and MethodsZhang et al 2018) The species tree topology (supplementaryfig 4 Supplementary Material online) also included full sup-port for the accepted phylogenetic relationships withinCetacea as well as within Mysticeti and Odontoceti Lowerlocal posterior probabilities for two of the internal brancheswithin laurasiatherian mammals were likely due to the exten-sive gene tree heterogeneity that has complicated phyloge-netic reconstruction of the placental mammalian lineages(Tarver et al 2016)

We estimated divergence times in a Bayesian frameworkusing the 4D and SGO data sets independently in MCMCtree(Yang and Rannala 2006) resulting in similar posterior distri-butions and parameter estimates with overlapping highestposterior densities for the estimated divergence times ofshared nodes across the 4D and SGO phylogenies (supple-mentary figs 5 and 6 and tables 4 and 5 SupplementaryMaterial online) We estimated that the time to the mostrecent common ancestor (TMRCA) of placental mammalswas 100ndash114 Ma during the late Cretaceous the TMRCA ofcow and cetaceans (Cetartiodactyla) was 52ndash65 Ma duringthe Eocene or Paleocene the TMRCA of extant cetaceans was29ndash35 Ma during the early Oligocene or late Eocene (betweenthe two data sets) the TMRCA of baleen whales was placed9ndash26 Ma in the early Miocene or middle Oligocene and theTMRCA of humpback and common minke whales (familyBalaenopteridae) was 4ndash22 Ma during the early Pliocene orthe Miocene (fig 2A)

Table 3 Repetitive Content of the Humpback Whale (Megaptera novaeangliae) Genome Estimated with a Library of Known Mammalian Repeats(RepBase) and De Novo Repeat Identification (RepeatModeler)

RepBase RepeatModeler

Repeat Type Length (bp) Genome(3885 total)

Length (bp) Genome(3025 total)

SINEs 137574621 607 75509694 333LINEs 440955223 1946 432017456 1907LTR 142117286 627 94177184 416DNA transposons 84243186 372 54015996 238Unclassified 1303231 006 4339463 019Satellites 48894580 216 197862 001Simple repeats 20779839 092 20848394 092Low complexity 4167187 018 4281173 019

A

B

FIG 1 Substitution rates in cetacean genomes (A) Maximum likeli-hood phylogeny of 12 mammals based on 2763828 fourfold degen-erate sites Branch lengths are given in terms of substitutions per siteexcept for branches with hatched lines which are shortened for visualconvenience All branches received 100 bootstrap support (B)Based on the phylogeny in (A) a comparison of the estimated DNAsubstitution rates (in terms of substitutions per site per million years)between terminal and internal cetacean branches and terminal andinternal branches of all other mammals

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1749

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

FIG 2 Timescale of humpback whale evolution (A) Species phylogeny of 28 mammals constructed from 152 orthologs and time-calibrated usingMCMCtree Branch lengths are in terms of millions of years Node bars indicate 95 highest posterior densities of divergence times Cetaceans arehighlighted in the gray box with mean estimates of divergence times included (B) The effective population size (Ne) changes over timeDemographic histories of two North Atlantic humpback whales estimated from the PSMC analysis including 100 bootstrap replicates per analysisMutation rate used was 154e-9 per year and generation time used was 215 years

Tollis et al doi101093molbevmsz099 MBE

1750

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 4: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

and is likely due to the highly repetitive nature of cetaceangenomes (Arnason and Widegren 1989)

With 95ndash96 of near-universal orthologs from OrthoDBv9 (Sim~ao et al 2015) present in the assembly as well as 97of a set of core eukaryotic genes (Parra et al 2009) the esti-mated gene content of the humpback whale genome assem-bly suggests a high-quality genome with good generepresentation (table 1) To aid in genome annotation wecarried out skin transcriptome sequencing which resulted in281642354 reads (NCBI BioProject PRJNA509641) Thesewere assembled into a transcriptome that includes 67 ofboth vertebrate and laurasiatherian orthologs and we pre-dicted 10167 protein-coding genes with likely ORFs that con-tain BLAST homology to SwissProt proteins (UniProtConsortium 2015) The large number of missing genes from

the transcriptome may be due to the small proportion ofgenes expressed in skin Therefore we also assessed homologywith ten mammalian proteomes from NCBI and the entireSwissProt database and ab initio gene predictors (seeMaterials and Methods supplementary Methods and sup-plementary fig 1 Supplementary Material online) for gene-calling The final genome annotation resulted in 24140 pro-tein-coding genes including 5446 with 50-untranslatedregions (UTRs) and 6863 with 30-UTRs We detected 15465one-to-one orthologs shared with human and 14718 withcow When we compared gene annotations across a sampleof mammalian genomes the humpback whale and bottle-nose dolphin genome assemblies had on average significantlyshorter introns (Pfrac14 004 unpaired T-test supplementary ta-ble 1 Supplementary Material online) which may in partexplain the smaller genome size of cetaceans comparedwith most other mammals (Zhang and Edwards 2012)

We estimated that between 30 and 39 of thehumpback whale genome comprised repetitive elements(table 3) Masking the assembly with a library of knownmammalian elements resulted in the identification of morerepeats than a de novo method suggesting that clade-specific repeat libraries are highly valuable when assessingrepetitive content The most abundant group of transpos-able elements in the humpback whale genome was theautonomous non-long terminal repeat (LTR) retrotranspo-sons (long interspersed nuclear elements or LINEs) whichcomprised nearly 20 of the genome most of which be-long to the LINE-1 clade as is typical of placental mammals(Boissinot and Sookdeo 2016) Large numbers of nonau-tonomous non-LTR retrotransposons in the form of shortinterspersed nuclear elements (SINEs) were also detectedin particular over 3 of the genome belonged to mamma-lian inverted repeats Although the divergence profile of denovo-derived repeat annotations in humpback whale in-cluded a decreased average genetic distance within trans-posable element subfamilies compared with the database-derived repeat landscape both repeat libraries displayed aspike in the numbers of LINE-1 and SINE retrotransposonsubfamilies near 5 divergence as did the repeat land-scapes of the bowhead whale orca and dolphin suggestingrecent retrotransposon activity in cetaceans (supplemen-tary figs 2 and 3 Supplementary Material online)

Slow DNA Substitution Rates in Cetaceans and theDivergence of Modern Whale LineagesWe computed a whole-genome alignment (WGA) of 12mammals including opossum elephant human mousedog cow sperm whale bottlenose dolphin orca bowheadwhale common minke whale and humpback whale (supple-mentary table 2 Supplementary Material online) andemployed human gene annotations to extract 2763828 ho-mologous 4-fold degenerate (4D) sites A phylogenetic anal-ysis of the 4D sites yielded the recognized evolutionaryrelationships (fig 1A) including reciprocally monophyleticMysticeti and Odontoceti When we compared the substitu-tions per site along the branches of the phylogeny we founda larger number of substitutions along the mouse

Table 1 Genomic Sequence Data Obtained for the Humpback WhaleGenome

Libraries Est Numberof Reads

Avg ReadLength

(bp)

EstDepth(total)

180 bp paired-end 1211320000 94 413300 bp paired-end 25820000 123 12500 bp paired-end 112400000 123 50600 bp paired-end 395500000 93 1342 kb mate-paired 348080000 49 6210 kb mate-paired 279000000 94 90

Subtotal for WGS libraries 2372120000 761Chicago Library 1 72000000 100 53Chicago Library 2 6000000 151 07Chicago Library 3 190000000 100 139Chicago Library 4 79000000 100 58Subtotal for Chicago Libraries 347000000 256Total for all sequence

libraries2719120000 1017

NOTEmdashWGS whole-genome shotgun

Table 2 Statistics for the Humpback Whale Genome Assembly

Feature Initial Assembly Final Assembly

Assembly length 227 Gb 227 GbContig N50 124 kb 123 kbLongest scaffold 22 Mb 294 MbNumber of scaffolds 24319 2558Scaffold N50 198 kb 914 MbScaffold N90 53 kb 235 MbScaffold L50 3214 79Scaffold L90 11681 273Percent genome

in gaps536 545

BUSCOa resultsmdashvertebrata

C 85[D 18] F 15 M 49 n 3023

BUSCOa resultsmdashlaurasiatheria

C 912[D 08] F 48 M 40 n 6253

CEGMAa results C 226 (9113) P 240 (9677)

NOTEmdashBUSCO Benchmarking Universal Single Copy Orthologs (Sim~ao et al 2015)C complete D duplicated F fragmented M missing CEGMA Core EukaryoticGenes Mapping Approach (Parra et al 2009) C complete P complete andorpartialaBUSCO and CEGMA results for final assembly only

Tollis et al doi101093molbevmsz099 MBE

1748

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

(rodent) branch supporting negative relationships betweengeneration time (Wilson Sayres et al 2011) speciation (Pagelet al 2006) and substitution rates When applying a semi-parametric penalized likelihood (PL) method to estimate sub-stitution rate variation at 4D sites across the 12 mammals we

found that cetaceans have accumulated the lowest numberof DNA substitutions per site per million years (fig 1B andsupplementary table 3 Supplementary Material online)which may be attributed to long generation times or slowermutation rates in cetaceans (Jackson et al 2009) Germlinemutation rates are related to somatic mutation rates withinspecies (Milholland et al 2017) therefore it is possible thatslow mutation rates may limit neoplastic progression andcontribute to cancer suppression in cetaceans which is aprediction of Petorsquos Paradox (Caulin and Maley 2011)

We also obtained 152 single-copy orthologs (single-geneortholog families or SGOs see Materials and Methods andsupplementary Methods Supplementary Material online)identified in at least 24 out of 28 species totaling314844 bp and reconstructed gene trees that were binnedand analyzed using a species tree method that incorporatesincomplete lineage sorting (see Materials and MethodsZhang et al 2018) The species tree topology (supplementaryfig 4 Supplementary Material online) also included full sup-port for the accepted phylogenetic relationships withinCetacea as well as within Mysticeti and Odontoceti Lowerlocal posterior probabilities for two of the internal brancheswithin laurasiatherian mammals were likely due to the exten-sive gene tree heterogeneity that has complicated phyloge-netic reconstruction of the placental mammalian lineages(Tarver et al 2016)

We estimated divergence times in a Bayesian frameworkusing the 4D and SGO data sets independently in MCMCtree(Yang and Rannala 2006) resulting in similar posterior distri-butions and parameter estimates with overlapping highestposterior densities for the estimated divergence times ofshared nodes across the 4D and SGO phylogenies (supple-mentary figs 5 and 6 and tables 4 and 5 SupplementaryMaterial online) We estimated that the time to the mostrecent common ancestor (TMRCA) of placental mammalswas 100ndash114 Ma during the late Cretaceous the TMRCA ofcow and cetaceans (Cetartiodactyla) was 52ndash65 Ma duringthe Eocene or Paleocene the TMRCA of extant cetaceans was29ndash35 Ma during the early Oligocene or late Eocene (betweenthe two data sets) the TMRCA of baleen whales was placed9ndash26 Ma in the early Miocene or middle Oligocene and theTMRCA of humpback and common minke whales (familyBalaenopteridae) was 4ndash22 Ma during the early Pliocene orthe Miocene (fig 2A)

Table 3 Repetitive Content of the Humpback Whale (Megaptera novaeangliae) Genome Estimated with a Library of Known Mammalian Repeats(RepBase) and De Novo Repeat Identification (RepeatModeler)

RepBase RepeatModeler

Repeat Type Length (bp) Genome(3885 total)

Length (bp) Genome(3025 total)

SINEs 137574621 607 75509694 333LINEs 440955223 1946 432017456 1907LTR 142117286 627 94177184 416DNA transposons 84243186 372 54015996 238Unclassified 1303231 006 4339463 019Satellites 48894580 216 197862 001Simple repeats 20779839 092 20848394 092Low complexity 4167187 018 4281173 019

A

B

FIG 1 Substitution rates in cetacean genomes (A) Maximum likeli-hood phylogeny of 12 mammals based on 2763828 fourfold degen-erate sites Branch lengths are given in terms of substitutions per siteexcept for branches with hatched lines which are shortened for visualconvenience All branches received 100 bootstrap support (B)Based on the phylogeny in (A) a comparison of the estimated DNAsubstitution rates (in terms of substitutions per site per million years)between terminal and internal cetacean branches and terminal andinternal branches of all other mammals

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1749

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

FIG 2 Timescale of humpback whale evolution (A) Species phylogeny of 28 mammals constructed from 152 orthologs and time-calibrated usingMCMCtree Branch lengths are in terms of millions of years Node bars indicate 95 highest posterior densities of divergence times Cetaceans arehighlighted in the gray box with mean estimates of divergence times included (B) The effective population size (Ne) changes over timeDemographic histories of two North Atlantic humpback whales estimated from the PSMC analysis including 100 bootstrap replicates per analysisMutation rate used was 154e-9 per year and generation time used was 215 years

Tollis et al doi101093molbevmsz099 MBE

1750

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 5: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

(rodent) branch supporting negative relationships betweengeneration time (Wilson Sayres et al 2011) speciation (Pagelet al 2006) and substitution rates When applying a semi-parametric penalized likelihood (PL) method to estimate sub-stitution rate variation at 4D sites across the 12 mammals we

found that cetaceans have accumulated the lowest numberof DNA substitutions per site per million years (fig 1B andsupplementary table 3 Supplementary Material online)which may be attributed to long generation times or slowermutation rates in cetaceans (Jackson et al 2009) Germlinemutation rates are related to somatic mutation rates withinspecies (Milholland et al 2017) therefore it is possible thatslow mutation rates may limit neoplastic progression andcontribute to cancer suppression in cetaceans which is aprediction of Petorsquos Paradox (Caulin and Maley 2011)

We also obtained 152 single-copy orthologs (single-geneortholog families or SGOs see Materials and Methods andsupplementary Methods Supplementary Material online)identified in at least 24 out of 28 species totaling314844 bp and reconstructed gene trees that were binnedand analyzed using a species tree method that incorporatesincomplete lineage sorting (see Materials and MethodsZhang et al 2018) The species tree topology (supplementaryfig 4 Supplementary Material online) also included full sup-port for the accepted phylogenetic relationships withinCetacea as well as within Mysticeti and Odontoceti Lowerlocal posterior probabilities for two of the internal brancheswithin laurasiatherian mammals were likely due to the exten-sive gene tree heterogeneity that has complicated phyloge-netic reconstruction of the placental mammalian lineages(Tarver et al 2016)

We estimated divergence times in a Bayesian frameworkusing the 4D and SGO data sets independently in MCMCtree(Yang and Rannala 2006) resulting in similar posterior distri-butions and parameter estimates with overlapping highestposterior densities for the estimated divergence times ofshared nodes across the 4D and SGO phylogenies (supple-mentary figs 5 and 6 and tables 4 and 5 SupplementaryMaterial online) We estimated that the time to the mostrecent common ancestor (TMRCA) of placental mammalswas 100ndash114 Ma during the late Cretaceous the TMRCA ofcow and cetaceans (Cetartiodactyla) was 52ndash65 Ma duringthe Eocene or Paleocene the TMRCA of extant cetaceans was29ndash35 Ma during the early Oligocene or late Eocene (betweenthe two data sets) the TMRCA of baleen whales was placed9ndash26 Ma in the early Miocene or middle Oligocene and theTMRCA of humpback and common minke whales (familyBalaenopteridae) was 4ndash22 Ma during the early Pliocene orthe Miocene (fig 2A)

Table 3 Repetitive Content of the Humpback Whale (Megaptera novaeangliae) Genome Estimated with a Library of Known Mammalian Repeats(RepBase) and De Novo Repeat Identification (RepeatModeler)

RepBase RepeatModeler

Repeat Type Length (bp) Genome(3885 total)

Length (bp) Genome(3025 total)

SINEs 137574621 607 75509694 333LINEs 440955223 1946 432017456 1907LTR 142117286 627 94177184 416DNA transposons 84243186 372 54015996 238Unclassified 1303231 006 4339463 019Satellites 48894580 216 197862 001Simple repeats 20779839 092 20848394 092Low complexity 4167187 018 4281173 019

A

B

FIG 1 Substitution rates in cetacean genomes (A) Maximum likeli-hood phylogeny of 12 mammals based on 2763828 fourfold degen-erate sites Branch lengths are given in terms of substitutions per siteexcept for branches with hatched lines which are shortened for visualconvenience All branches received 100 bootstrap support (B)Based on the phylogeny in (A) a comparison of the estimated DNAsubstitution rates (in terms of substitutions per site per million years)between terminal and internal cetacean branches and terminal andinternal branches of all other mammals

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1749

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

FIG 2 Timescale of humpback whale evolution (A) Species phylogeny of 28 mammals constructed from 152 orthologs and time-calibrated usingMCMCtree Branch lengths are in terms of millions of years Node bars indicate 95 highest posterior densities of divergence times Cetaceans arehighlighted in the gray box with mean estimates of divergence times included (B) The effective population size (Ne) changes over timeDemographic histories of two North Atlantic humpback whales estimated from the PSMC analysis including 100 bootstrap replicates per analysisMutation rate used was 154e-9 per year and generation time used was 215 years

Tollis et al doi101093molbevmsz099 MBE

1750

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 6: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

FIG 2 Timescale of humpback whale evolution (A) Species phylogeny of 28 mammals constructed from 152 orthologs and time-calibrated usingMCMCtree Branch lengths are in terms of millions of years Node bars indicate 95 highest posterior densities of divergence times Cetaceans arehighlighted in the gray box with mean estimates of divergence times included (B) The effective population size (Ne) changes over timeDemographic histories of two North Atlantic humpback whales estimated from the PSMC analysis including 100 bootstrap replicates per analysisMutation rate used was 154e-9 per year and generation time used was 215 years

Tollis et al doi101093molbevmsz099 MBE

1750

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 7: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

A Complex Demographic History of North AtlanticHumpback WhalesWe estimated the demographic history of the North Atlantichumpback whale population applying the Pairwise SequentialMarkovian Coalescent (PSMC) (Li and Durbin 2011) to theshort-insert libraries generated during this study as well assequence reads from a second North Atlantic humpbackwhale (Arnason et al 2018) (fig 2B supplementary figs 7and 8 Supplementary Material online) Consistent with thefindings of Arnason et al (2018) we estimated that the largesthumpback whale population sizes were 2 Ma during thePliocenendashPleistocene transition followed by a steady declineuntil 1 Ma The PSMC trajectories of the two humpbackwhales began to diverge 100000 years ago and the esti-mated confidence intervals from 100 bootstraps for eachPSMC analysis were nonoverlapping in the more recentbins Both humpback PSMC trajectories suggested sharp pop-ulation declines beginning 25000ndash45000 years agoHowever interpreting inferred PSMC plots of pastldquodemographicrdquo changes is nontrivial in a globally distributedspecies connected by repeated occasional gene flow such ashumpback whales (Baker et al 1993 Palsboslashll et al 1995Jackson et al 2014) The apparent changes in effective pop-ulation size may represent changes in abundance interoceanconnectivity or a combination of both (Hudson 1990 Palsboslashllet al 2013) Several genetic and genome-based studies ofcetaceans have demonstrated how past large-scale oceanicchanges have affected the evolution of cetaceans (Steemanet al 2009) including baleen whales (Arnason et al 2018)Although the population genetic structure of humpbackwhales in the North Atlantic is not fully resolved the levelof genetic divergence among areas is very low (Larsen et al1996 Valsecchi et al 1997) Therefore the difference betweenthe two humpback whale PSMC trajectories may be due torecent admixture (Baker et al 1993 Palsboslashll et al 1995 Ruegget al 2013 Jackson et al 2014) intraspecific variation andpopulation structure (Mazet et al 2016) as well as errorsdue to differences in sequence coverage (Nadachowska-Brzyska et al 2016)

Segmental Duplications in Cetacean GenomesContain Genes Involved in Apoptosis and TumorSuppressionMammalian genomes contain gene-rich segmental duplica-tions (Alkan et al 2009) which may represent a powerfulmechanism by which new biological functions can arise(Kaessmann 2010) We employed a read-mapping approachto annotate large segmental duplications (LSDs) 10 kb inthe humpback whale genome assembly and ten additionalcetaceans for which whole-genome shotgun data were avail-able (see Materials and Methods supplementary Methodsand supplementary table 6 Supplementary Material online)We found that cetacean genomes contained on average 318LSDs (656 SD) which comprised 99 Mb (618 Mb) andaveraged 31 kb in length (624 kb) We identified10128534 bp (04) of the humpback whale genome assem-bly that comprised 293 LSDs averaging 34568 bp in length

Fifty-one of the LSDs were shared across all 11 cetaceangenomes (supplementary fig 9 Supplementary Material on-line) In order to determine the potential role of segmentalduplications during the evolution of cetacean-specific pheno-types we identified 426 gene annotations that overlappedcetacean LSDs including several genes annotated for viralresponse Other genes on cetacean LSDs were involved inaging in particular DLD in the bowhead whale andKCNMB1 in the blue whale this may reflect relevant adapta-tions contributing to longevity in two of the largest andlongest-lived mammals (Ohsumi 1979 George et al 1999)Multiple tumor suppressor genes were located on cetaceanLSDs including 1) SALL4 in the sei whale 2) TGM3 andSEMA3B in the orca 3) UVRAG in the sperm whale NorthAtlantic right whale and bowhead whale and 4) PDCD5which is upregulated during apoptosis (Zhao et al 2015)and was found in LSDs of all 11 queried cetacean genomesPDCD5 pseudogenes have been identified in the human ge-nome and several Ensembl-hosted mammalian genomescontain one-to-many PDCD5 orthologs however we anno-tated only a single copy of PDCD5 in the humpback whaleassembly This suggests that in many cases gene duplicationsare collapsed during reference assembly but can be retrievedthrough shotgun read-mapping methods (Carbone et al2014) We annotated fully resolved SALL4 and UVRAG copynumber variants in the humpback whale genome assemblyand by mapping the RNA-Seq data from skin to the genomeassembly and annotation (see Materials and methods) wefound that three annotated copies of SALL4 were expressed inhumpback whale skin as were two copies of UVRAG

We also found that145 Mb (6923 kb) of each cetaceangenome consists of LSDs not found in other cetaceans mak-ing them species-specific which averaged 244 kb (6146kb) in length (supplementary table 7 and fig 9Supplementary Material online) The minke whale genomecontained the highest number of genes on its species-specificLSDs (32) After merging the LSD annotations for the twohumpback whales we identified 57 species-specific LSDs forthis species comprising 977 kb and containing nine dupli-cated genes Humpback whale-specific duplications includedthe genes PRMT2 which is involved in growth and regulationand promotes apoptosis SLC25A6 which may be responsiblefor the release of mitochondrial products that trigger apopto-sis and NOX5 which plays a role in cell growth and apoptosis(UniProt Consortium 2015) Another tumor suppressor geneTPM3 was duplicated in the humpback whale assemblybased on our gene annotation However these extranumer-ary copies of TPM3 were not annotated on any humpbackwhale LSDs lacked introns and contained mostly the sameexons suggesting retrotransposition rather than segmentalduplication as a mechanism for their copy number expansion(Kaessmann 2010) According to the RNA-Seq data all sevencopies of TPM3 are expressed in humpback whale skin

Duplications of the tumor suppressor gene TP53 havebeen inferred as evidence for cancer suppression in elephants(Abegglen et al 2015 Caulin et al 2015 Sulak et al 2016)During our initial scans for segmental duplications we no-ticed a large pileup of reads in the MAKER-annotated

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1751

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 8: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

humpback whale TP53 (data not shown) We PCR-amplifiedcloned and sequenced this region from a humpback whaleDNA sample inferring four haplotypes that differ at two bases(supplementary Methods Supplementary Material online)After manually annotating TP53 in the humpback whalewe determined that these nucleotide variants fell in noncod-ing regions of the gene one occurred upstream of the startcodon whereas the other occurred between the first andsecond coding exons Other genomic studies have concludedthat TP53 is not duplicated in cetaceans (Yim et al 2014Keane et al 2015 Sulak et al 2016) We consider the possi-bility of at least two TP53 homologs in the genome of thehumpback whale although more data are required to resolvethis Regardless cancer suppression likely arose in differentmammalian lineages via multiple molecular etiologiesOverall our results reveal several copy number expansionsin cetaceans related to immunity aging and cancer suggest-ing that cetaceans are among the large mammals that haveevolved specific adaptations related to cancer resistance

Accelerated Regions in Cetacean Genomes AreSignificantly Enriched with Pathways Relevant toCancerIn order to determine genomic loci underlying cetacean adap-tations we estimated regions in the 12-mammal WGA withelevated substitution rates that were specific to the cetaceanbranches of the mammalian phylogeny These genomicregions departed from neutral expectations in a manner con-sistent with either positive selection or relaxed purifying se-lection along the cetacean lineage (Pollard et al 2010) Wesuccessfully mapped 3260 protein-coding genes with func-tional annotations that overlap cetacean-specific acceleratedregions which were significantly enriched for Gene Ontology(GO) categories such as cell-cell signaling (GO0007267) andcell adhesion (GO0007155) (table 4) Adaptive change in cellsignaling pathways could have maintained the ability of ceta-ceans to prevent neoplastic progression as they evolved largerbody sizes Adhesion molecules are integral to the develop-ment of cancer invasion and metastasis and these resultssuggest that cetacean evolution was accompanied by selec-tion pressure changes on both intra- and extracellular inter-actions Cetacean-specific genomic regions with elevatedsubstitution rates were also significantly enriched in genesinvolved in B-cell-mediated immunity (GO0019724) likely

due to the important role of regulatory cells which modulateimmune response to not only pathogens but perhaps tumorsas well In addition cetacean-specific acceleration in regionscontrolling complement activation (GO0006956) may haveprovided better immunosurveillance against cancer and fur-ther protective measures against malignancies (Pio et al2014) We also found that accelerated regions in cetaceangenomes were significantly enriched for genes controllingsensory perception of smell (GO0007608) perhaps due tothe relaxation of purifying selection in olfactory regions whichwere found to be underrepresented in cetacean genomes(Yim et al 2014)

Selection Pressures on Protein-Coding Genes duringCetacean Evolution Point to Many CetaceanAdaptations Including Cancer SuppressionTo gain further insight into the genomic changes underlyingthe evolution of large body sizes in cetaceans we employedphylogenetic targeting to maximize statistical power in pair-wise evolutionary genomic analyses (Arnold and Nunn 2010)This resulted in maximal comparisons between 1) the orcaand the bottlenose dolphin and 2) the humpback whale andcommon minke whale Despite their relatively recent diver-gences (eg the orcabottlenose dolphin divergence is similarin age to that of the humanchimpanzee divergence seefig 2) the species pairs of common minkehumpback andorcadolphin have each undergone extremely divergent evo-lution in body size and longevity (fig 3) Humpback whalesare estimated to weigh up to four times as much as commonminke whales and are reported to have almost double thelongevity and orcas may weigh almost 20 times as much asbottlenose dolphins also with almost double the lifespan(Tacutu et al 2012) In order to offset the tradeoffs associatedwith the evolution of large body size with the addition ofmany more cells and longer lifespans since the divergence ofeach species pair we hypothesize that necessary adaptationsfor cancer suppression should be encoded in the genomes aspredicted by Petorsquos Paradox (Tollis et al 2017)

For each pairwise comparison we inferred pairwise ge-nome alignments with the common minke whale and orcagenome assemblies as targets respectively and extractedprotein-coding orthologous genes We then estimated theratio of nonsynonymous substitutions per synonymous siteto synonymous substitutions per synonymous site (dNdS) in

Table 4 GO Terms for Biological Processes Overrepresented by Genes Overlapping Genomic Regions with Elevated Substitution Rates That AreUnique to the Cetacean Lineage

Go Term Description Number of Genes Fold Enrichment P-Valuea

GO0007608 Sensory perception of smell 157 458 124E-40GO0006956 Complement activation 39 291 491E-05GO0019724 B-cell-mediated immunity 39 291 491E-05GO0032989 G-protein-coupled receptor signaling pathway 159 244 196E-17GO0042742 Defense response to bacterium 39 244 220E-03GO0009607 Response to biotic stimulus 43 209 185E-02GO0007155 Cell adhesion 101 199 174E-06GO0007267 Cellndashcell signaling 137 169 284E-05

aAfter Bonferonni correction for multiple testing

Tollis et al doi101093molbevmsz099 MBE

1752

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 9: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

order to measure selective pressures acting on each ortholo-gous gene pair during cetacean evolution A dNdSgt1 is usedto infer potentially functional amino acid changes in candi-date genes subjected to positive selection (Fay and Wu 2003)Among an estimated 435 genes with dNdS gt1 in the com-mon minkehumpback pairwise comparison we detectedeight genes belonging to the JAK-STAT signaling pathway(39-fold enrichment Pfrac14 11E-3 Fisherrsquos exact test) and seveninvolved in cytokinendashcytokine receptor interaction (41-foldenrichment Pfrac14 17E-2 Fisherrsquos exact test) suggesting positiveselection acting on pathways involved in cell proliferationThese genes included multiple members of the tumor necro-sis factor subfamily such as TNFSF15 which inhibits angiogen-esis and promotes the activation of caspases and apoptosis(Yu et al 2001) A dNdS gt1 was also detected in seven genesinvolved in the negative regulation of cell growth(GO0030308 31-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) and five genes involved in double-strand break repair(GO0006302 40-fold enrichment Pfrac14 803E-3 Fisherrsquos exacttest) Although these results suggest the evolution of aminoacid differences since the split between common minke andhumpback whales in genes affecting cell growth proliferationand maintenance the GO category enrichment tests did notpass significance criteria after Bonferroni corrections for mul-tiple testing We found 18 genes that are mutated in cancersaccording to the COSMIC v85 database (Forbes et al 2015) inthe common minkehumpback comparison including a sub-set of five annotated as tumor suppressor genes oncogenesor fusion genes in the Cancer Gene Census (CGC Futreal et al2004) which are highlighted in table 5 The complete list ofCOSMIC genes with elevated dNdS in the pairwise compar-isons is given in supplementary table 8 SupplementaryMaterial online We detected 555 orthologous genes withdNdS gt1 in the orcadolphin comparison which are signifi-cantly enriched (after Bonferroni correction for multiple test-ing) for biological processes such as immune response cellactivation and regulation of cytokines (table 6) and 41 of

which are known cancer genes according to COSMIC andCGC (table 5 supplementary table 8 Supplementary Materialonline) These results are consistent with our accelerated re-gion analysis based on the WGA which showed acceleratedevolution in immunity pathways (above see table 4) Forinstance eight genes (CD58 CD84 KLF13 SAMSN1 CTSGGPC3 LTF and SPG21) annotated for immune system process(GO0002376) were found in cetacean-specific acceleratedgenomic regions and also had a pairwise dNdS gt1 in theorcadolphin comparison mirroring other recent genomicanalyses of immunity genes in orcas (Ferris et al 2018) Ourresults also suggest that the evolution of gigantism and longlifespans in cetaceans was accompanied by selection actingon many genes related to somatic maintenance and cellsignaling

As a more accurate assessment of selection pressure var-iation acting on protein-coding genes across cetacean evolu-tion we conducted an additional assessment of dNdS usingbranch-site codon models implemented in codeml (Yang1998) We employed extensive filtering of the branch siteresults including both false discovery rate (FDR) andBonferroni corrections for multiple testing (see Materialsand methods) and conservatively estimated that 450protein-coding genes were subjected to positive selection incetaceans These include 54 genes along the ancestral ceta-cean branch 12 along the ancestral toothed whale branch 84along the ancestral baleen whale branch 74 in the ancestor ofcommon minke and humpback whales and 212 unique tothe humpback whale branch (fig 4A) Cetacean positivelyselected genes were annotated for functions related to exten-sive changes in anatomy growth cell signaling and cell pro-liferation (fig 4B) For instance in the branch-site models forhumpback whale positively selected genes are enriched forseveral higher-level mouse limb phenotypes including thoseaffecting the limb long bones (MP0011504 15 genes FDR-corrected P-valuefrac14 0001) and more specifically the hindlimb stylopod (MP0003856 seven genes FDRfrac14 0024) or

FIG 3 Diversity in both body size and lifespan within rorqual baleen whales (Balaenopteridae) and dolphins (Delphinidae) Maximal pairings usingphylogenetic targeting (Arnold and Nunn 2010) of genome assembly-enabled cetaceans resulted in the most extreme divergence in both body sizeand lifespan between humpback whale (Megaptera novaeangliae) and common minke whale (Balaenoptera acutorostrata) within theBalaenopteridae facing right and orca (Orcinus orca) and bottlenose dolphin (Tursiops truncatus) within the Delphinidae facing left Traitdata were collected from the panTHERIA (Jones et al 2009) and AnAge (Tacutu et al 2012) databases

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1753

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 10: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

Table 5 CGC Genes with dNdS gt 1 as Revealed by Pairwise Comparisons of Cetacean Genomes

Comparison Gene Symbol Gene Name Role in Cancer Function

Minkehumpback CD274 CD274 molecule TSG Plays a critical role in induction and maintenance ofimmune tolerance to selfa

ETNK1 Ethanolamine kinase 1 TSG Suppresses escaping of programmed cell deathb

IL21R Interleukin 21 receptor Fusion The ligand binding of this receptor leads to the acti-vation of multiple downstream signaling mole-cules including JAK1 JAK3 STAT1 and STAT32

MYOD1 Myogenic differentiation 1 Fusion Regulates muscle cell differentiation by inducing cellcycle arrest a prerequisite for myogenic initiationa

PHF6 PHD finger protein 6 TSG Encodes a protein with two PHD-type zinc fingerdomains indicating a potential role in transcrip-tional regulation that localizes to the nucleolusa

Orcadolphin BTG1 B-cell translocation gene 1antiproliferative

TSG fusion Member of an antiproliferative gene family that reg-ulates cell growth and differentiationa

CD274 CD274 molecule TSG fusion Plays a critical role in induction and maintenance ofimmune tolerance to selfa

FANCD2 Fanconi anemia comple-mentation group D2

TSG Suppresses genome instability and mutations pro-motes escaping programmed cell death suppressesproliferative signaling suppresses invasion andmetastasisb

FAS Fas cell surface deathreceptor

TSG Promotes cell replicative immortality promotesproliferative signaling promotes invasion and me-tastasis suppresses escaping programmed celldeathb

FGFR4 Fibroblast growth factor re-ceptor 4

Oncogene Promotes proliferative signaling promotes invasionand metastasisb

GPC3 Glypican 3 Oncogene TSG Promotes invasion and metastasis promotes sup-pression of growthb

HOXD11 Homeobox D11 Oncogene fusion The homeobox genes encode a highly conservedfamily of transcription factors that play an impor-tant role in morphogenesis in all multicellularorganismsa

HOXD13 Homeobox D13 Oncogene fusion

LASP1 LIM and SH3 protein 1 Fusion The encoded protein has been linked to metastaticbreast cancer hematopoetic tumors such as B-celllymphomas and colorectal cancera

MLF1 Myeloid leukemia factor 1 TSG fusion This gene encodes an oncoprotein which is thought toplay a role in the phenotypic determination ofhemopoetic cells Translocations between this geneand nucleophosmin have been associated withmyelodysplastic syndrome and acute myeloidleukemiaa

MYB v-myb myeloblastosis viraloncogene homolog

Oncogene fusion This gene may be aberrantly expressed or rearrangedor undergo translocation in leukemias and lym-phomas and is considered to be an oncogenea

MYD88 Myeloid differentiation pri-mary response gene (88)

Oncogene Promotes escaping programmed cell death promotesproliferative signaling promotes invasion and me-tastasis promotes tumor promotinginflammationb

NR4A3 Nuclear receptor subfamily4 group A member 3(NOR1)

Oncogene fusion Encodes a member of the steroidndashthyroid hormonendashretinoid receptor superfamily that may act as atranscriptional activatora

PALB2 Partner and localizer ofBRCA2

TSG This protein binds to and colocalizes with the breastcancer 2 early onset protein (BRCA2) in nuclear fociand likely permits the stable intranuclear localiza-tion and accumulation of BRCA2a

PML Promyelocytic leukemia TSG fusion Expression is cell-cycle related and it regulates the p53response to oncogenic signalsa

RAD21 RAD21 homolog(Schizosaccharomycespombe)

Oncogene TSG Promotes invasion and metastasis suppresses ge-nome instability and mutations suppresses escap-ing programmed cell deathb

STIL SCLTAL1 interruptinglocus

Oncogene fusion Encodes a cytoplasmic protein implicated in regula-tion of the mitotic spindle checkpoint a regulatorypathway that monitors chromosome segregationduring cell division to ensure the proper distribu-tion of chromosomes to daughter cellsa

(continued)

Tollis et al doi101093molbevmsz099 MBE

1754

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 11: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

femur (MP0000559 six genes FDRfrac14 0019) These pheno-types are reminiscent of the developmental basis of hind limbloss in cetaceans embryonic studies show that hind limbbuds are initially formed but disappear by the fifth gestationalweek (Thewissen et al 2006) Enriched mouse phenotypes arealso related to the unique cetacean axial skeleton(MP0002114 25 genes FDRfrac14 0016) most notably in theskull including craniofacial bones (MP0002116 17 genesFDRfrac14 0018) teeth (MP0002100 nine genes FDRfrac14 0004)and the presphenoid (MP0030383 three genesFDRfrac14 0003) Past analyses of the cetacean basicranial ele-ments revealed that the presphenoid was extensively modi-fied along the cetacean lineage (Ichishima 2016)

Positively selected genes unique to the humpback whalewere significantly enriched for a single biological process reg-ulation of cell cycle checkpoint (GO1901976 1857-fold en-richment Pfrac14 002 after Bonferonni correction for multipletesting) suggesting positive selection in pathways that con-trol responses to endogenous or exogenous sources of DNAdamage and limit cancer progression (Kastan and Bartek2004) We detected a significant number of proteinndashproteininteractions among humpback whale-specific positively se-lected genes (number of nodesfrac14 204 number of edgesfrac14 71expected number of edgesfrac14 51 Pfrac14 0004 supplementaryfig 10 Supplementary Material online) including genes thatare often coexpressed and involved in DNA repair DNA rep-lication and cell differentiation For instance we identifiedsignificant interactions between DNA2 which encodes a heli-case involved in the maintenance of DNA stability andWDHD1 which acts as a replication initiation factorAnother robust protein interaction network was detected

between a number of genes involved in the genesis and main-tenance of primary cilia The highest scoring functional an-notation clusters resulted in key words such as ciliopathy(seven genes) and cell projection (16 genes) and GO termssuch as cilium morphogenesis cilium assembly ciliary basalbody and centriole The primary cilia of multicellular eukar-yotes control cell proliferation by mediating cell-extrinsic sig-nals and regulating cell cycle entry and defects in ciliaryregulation are common in many cancers (Michaud andYoder 2006)

Our branch-site test results indicated that the evolution ofcetacean gigantism was accompanied by strong selection onmany pathways that are directly linked to cancer (fig 4C) Weidentified 33 genes that are mutated in human cancers(according to the COSMIC database) that were inferred assubjected to positive selection in the humpback whale line-age including the known tumor suppressor genes ATR whichis a protein kinase that senses DNA damage upon genotixicstress and activates cell cycle arrest and RECK which sup-presses metastasis (Forbes et al 2015) Multiple members ofthe PR domain-containing gene family (PRDM) evolved un-der positive selection across cetaceans including the tumorsuppressor genes PRDM1 whose truncation leads to B-cellmalignancies and PRDM2 which regulates the expressionand degradation of TP53 (Shadat et al 2010) and whoseforced expression causes apoptosis and cell cycle arrest incancer cell lines (Fog et al 2012) In baleen whales ERCC5which is a DNA repair protein that partners with BRCA1 andBRCA2 to maintain genomic stability (Trego et al 2016) andsuppresses UV-induced apoptosis (Clement et al 2006)appeared to have been subjected to positive selection as

Table 5 Continued

Comparison Gene Symbol Gene Name Role in Cancer Function

TAL1 T-cell acute lymphocyticleukemia 1 (SCL)

Oncogene fusion Implicated in the genesis of hemopoietic malignan-cies and may play an important role in hemopoieticdifferentiationa

TNFRSF14 Tumor necrosis factor re-ceptor superfamilymember 14 (herpesvirusentry mediator)

TSG The encoded protein functions in signal transductionpathways that activate inflammatory and inhibi-tory T-cell immune responsea

TNFRSF17 Tumor necrosis factor re-ceptor superfamilymember 17

Oncogene fusion This receptor also binds to various TRAF familymembers and thus may transduce signals for cellsurvival and proliferationa

NOTEmdashTSG tumor suppressor geneaSource RefSeqbSource Cancer Hallmark from CGC

Table 6 GO Terms for Biological Processes Overrepresented by Genes with Pairwise dNdS gt 1 in the Orca Bottlenose Dolphin Comparison

GO Term Description Number of Genes Fold Enrichment P-Valuea

GO0031347 Regulation of defense response 36 270 291E-2GO0050776 Regulation of immune response 50 218 195E-3GO0002694 Regulation of leukocyte activation 31 249 189E-2GO0002275 Myeloid cell activation involved in immune response 29 253 237E-2GO0002699 Positive regulation of immune effector process 19 446 101-E3GO0042108 Positive regulation of cytokine biosynthetic process 9 687 223E-2

aAfter Bonferonni correction for multiple testing

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1755

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 12: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

well Across all the branch-site models positively selectedgenes represented multiple functional categories relevant tocancer and Petorsquos Paradox

Among the cancer-related genes subjected to positiveselection in cetaceans we identified two with identicalamino acid changes among disparate taxa united by thetraits of large body size andor extreme longevitySpecifically PRDM13 is a tumor suppressor gene that actsas a transcriptional repressor and we found identical DEamino acid substitutions in this gene in sperm whale dol-phin orca and humpback but also manatee (Trichechusmanatus) and African elephant (Loxodonta africana) whichare large-bodied afrotherian mammals that have been thefocus of cancer suppression research (Abegglen et al 2015Sulak et al 2016) Secondly POLE is a cancer-related genethat participates in DNA repair and replication and weobserved one IV substitution shared among orca dol-phin bowhead humpback and common minke whalebut also elephant as well as a second IV substitutionshared with these cetaceans and the little brown bat(Myotis lucifugus) Vesper bats such as M lucifugus areknown for their exceptional longevity relative to theirbody size and have been proposed as model organisms insenescence and cancer research (Foley et al 2018) Parallelchanges in cancer-related genes across these

phylogenetically distinct mammals suggest natural selectionhas acted on similar pathways that limit neoplastic progres-sion in large and long-lived species (Tollis et al 2017)

Petorsquos Paradox and Cancer in Whales and Other LargeMammalsLarge body size has evolved numerous times in mammalsand although it is exemplified in some extant cetaceans gi-gantism is also found in afrotherians perissodactyls and car-nivores (Baker et al 2015) Our results suggest that cancersuppression in large and long-lived mammals has also evolvednumerous times However none of these species iscompletely immune to cancer Elephants have at least a 5lifetime risk of cancer mortality (Abegglen et al 2015) whichis far less than humans but detecting cancer and estimatingcancer incidence and mortality rates in wild cetaceans is morechallenging Mathematical modeling predicting the lifetimerisk of colorectal cancer in mice and humans yielded a rate ofcolorectal cancer at 50 in blue whales by age 50 and 100by age 90 (Caulin et al 2015) This high rate of cancer mor-tality is an unlikely scenario and taken with our genomicresults presented here it suggests that cetaceans have evolvedmechanisms to limit their overall risk of cancer Among ba-leen whales benign neoplasms of the skin tongue and centralnervous system have been reported in humpback whales and

Orca

Minke whale

Cow

Bowhead whale

Bottlenose dolphin

Humpback whale

Sperm whale54

12

84

74

212A

ATRT Suppression of growth escaping programmedcell death change of cellular energetics

Humpback whale

BCORL1TF Invasion and metastasis

PICALMT Change of cellular energetics

PRDM2T Regulation of TP53 expression and degredation

TPRF Involved in the activation of oncogenic kinases

AMER1T Promotes suppression of growth

CCNB1IP1T Suppresses invasion and metastasis

POLET Participates in DNA repair and DNA replication

Cetacea ancestor

KMT2CT Methylates rsquoLys-4rsquo of histone H3

POU2AF1 Promotes proliferative signalingO

PRDM1T Regulation of TP53 expression anddegradation

ERCC5T Excision repair DNA double-strand breakrepair following UV damage

Mysticeti ancestor

PMLT Positively regulates TP53 by inhibiting MDM2-dependent degredation

Odontoceti ancestor

MUC1F Overexpressed in epithelial tumors

Minke and humpback whale ancestor

RETO Plays a role in cell differentiation and growth

C COSMIC GENES

anatomicalstructure

formation involvedin morphogenesis

autophagy

cellproliferation

cellminuscellsignaling

circulatorysystemprocess

growth

pigmentation

proteinmaturation

ribonucleoproteincomplexassembly

symbiosisencompassing

mutualismthrough

parasitismvacuolartransport

B

20

FIG 4 Positively selected genes during cetacean evolution (A) Species tree relationships of six modern cetaceans with complete genomeassemblies estimated from 152 single-copy orthologs Branch lengths are given in coalescent units Outgroup taxa are not shown The completespecies tree of 28 mammals is shown in supplementary figure 4 Supplementary Material online Boxes with numbers indicate the number ofpositively selected genes passing filters and a Bonferroni correction detected on each branch (B) TreeMap from REVIGO for GO biologicalprocesses terms represented by genes evolving under positive selection across all cetaceans Rectangle size reflects semantic uniqueness of GOterm which measures the degree to which the term is an outlier when compared semantically to the whole list of GO terms (C) Cancer gene namesand functions from COSMIC found to be evolving under positive selection in the cetacean branch-site models Superscripts for gene namesindicate as follows T tumor suppressor gene O oncogene F fusion gene Asterisks indicate P-value following FDR correction for multiple testingPlt 001 Plt 0001 Plt 00001

Tollis et al doi101093molbevmsz099 MBE

1756

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 13: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

ovarian carcinomas and lymphomas have been detected infin whales (Newman and Smith 2006) Among smaller ceta-ceans one unusually well-documented case study concludedthat 27 of beluga whales (Delphinapterus leucas) found deadin the St Lawrence estuary had cancer which may have con-tributed to 18 of the total mortality in that population(Martineau et al 2002) The authors suggested that thehigh degree of polycyclic aromatic hydrocarbons releasedinto the estuary by nearby industry may have contributedto this elevated cancer risk (Martineau et al 2002) Bycontrast the larger baleen whale species in the Gulf ofSt Lawrence appear to have lower contaminant burdenslikely due to ecological differences (Gauthier et al 1997)Interestingly unlike in human cells homologous recom-bination is uninhibited in North Atlantic right whale lungcells following prolonged exposure to the human lungcarcinogen particulate hexavalent chromium (Browninget al 2017) suggesting adaptations for high-fidelity DNArepair in whales

In this study we provide a de novo reference assemblyfor the humpback whalemdashone of the more well-studiedgiants living on Earth today The humpback whale genomeassembly is highly contiguous and contains a comparablenumber of orthologous genes to other mammalian ge-nome projects Our comparisons with other complete ce-tacean genomes confirm the results of other studies whichconcluded that rorqual whales likely began diversifyingduring the Miocene (Slater et al 2017 Arnason et al2018) We found indications of positive selection onmany protein-coding genes suggestive of adaptive changein pathways controlling the mammalian appendicular andcranial skeletal elements which are relevant to highly spe-cialized cetacean phenotypes as well as in many immunitygenes and pathways that are known to place checks onneoplastic progression LSDs in cetacean genomes containmany genes involved in the control of apoptosis includingknown tumor suppressor genes and skin transcriptomeresults from humpback whale suggest many gene duplica-tions whether through segmental duplication or retro-transposition are transcribed and hence likely functionalWe also use genome-wide evidence to show that germlinemutation rates may be slower in cetaceans than in othermammals which has been suggested in previous studies(Jackson et al 2009) and we suggest as a corollary thatcetacean somatic mutations rates may be lower as wellThese results are consistent with predictions stemmingfrom Petorsquos Paradox (Peto et al 1975 Caulin and Maley2011) which posited that gigantic animals have evolvedcompensatory adaptations to cope with the negativeeffects of orders of magnitude more cells and long lifespansthat increase the number of cell divisions and cancer riskover time Altogether the humpback whale genome as-sembly will aid comparative oncology research that seeksto improve therapeutic targets for human cancers as wellas provide a resource for developing useful genomicmarkers that will aid in the population management andconservation of whales

Materials and Methods

Tissue Collection and DNA ExtractionBiopsy tissue was collected from an adult female humpbackwhale (ldquoSaltrdquo NCBI BioSample SAMN1058501) in the Gulf ofMaine western North Atlantic Ocean using previously de-scribed techniques (Lambertsen 1987 Palsboslashll et al 1991) andflash frozen in liquid nitrogen We extracted DNA from skinusing the protocol for high-molecular-weight genomic DNAisolation with the DNeasy Blood and Tissue purification kit(Qiagen) Humpback whales can be individually identifiedand studied over time based on their unique ventral flukepigmentation (Katona and Whitehead 1981) Salt was specif-ically selected for this study because of her 35-year priorsighting history which is among the lengthiest and detailedfor an individual humpback whale (Center for Coastal Studiesunpublished data)

De Novo Assembly of the Humpback Whale GenomeUsing a combination of paired-end and mate-pair libraries denovo assembly was performed using Meraculous 204(Chapman et al 2011) with a kmer size of 47 Reads weretrimmed for quality sequencing adapters and mate-pairadapters using Trimmomatic (Bolger et al 2014) The genomesize of the humpback whale was estimated using the shortreads by counting the frequency of kmers of length 27 oc-curring in the 180-bp data set estimating the kmer coverageand using the following formula genome size frac14 total kmers kmer coverage

Chicago Library Preparation and SequencingFour Chicago libraries were prepared as described previously(Putnam et al 2016) Briefly for each library500 ng of high-molecular-weight genomic DNA (mean fragment lengthgt50kb) was reconstituted into chromatin in vitro and fixed withformaldehyde Fixed chromatin was digested with MboI orDpnII the 50-overhangs were repaired with biotinylatednucleotides and blunt ends were ligated After ligation cross-links were reversed and the DNA purified from protein Biotinthat was not internal to ligated fragments was removed fromthe purified DNA The DNA was then sheared to 350 bpmean fragment size and sequencing libraries were generatedusing NEBNext Ultra (New England BioLabs) enzymes andIllumina-compatible adapters Biotin-containing fragmentswere isolated using streptavidin beads before PCR enrichmentof each library The libraries were sequenced on an IlluminaHiSeq 2500 platform

Scaffolding the De Novo Assembly with HiRiseThe input de novo assembly shotgun reads and Chicagolibrary reads were used as input data for HiRise a softwarepipeline designed specifically for using Chicago data to scaf-fold genome assemblies (Putnam et al 2016) Shotgun andChicago library sequences were aligned to the draft inputassembly using a modified SNAP read mapper (httpsnapcsberkeleyedu) The separations of Chicago read pairsmapped within draft scaffolds were analyzed by HiRise toproduce a likelihood model for genomic distance between

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1757

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 14: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

read pairs and the model was used to identify putative mis-joins and to score prospective joins After scaffolding shotgunsequences were used to close gaps between contigs

Assessing the Gene Content of the Humpback WhaleAssemblyThe expected gene content of the assembly was evaluatedusing the Core Eukaryotic Genes Mapping Approach (Parraet al 2009) which searches the assembly for 458 highly con-served proteins and reports the proportion of 248 of the mosthighly conserved orthologs that are present in the assemblyWe also used the Benchmarking Universal Single CopyOrthologs (BUSCO v201 Sim~ao et al 2015) which analyzesgenome assemblies for the presence of 3023 genes conservedacross vertebrates as well as a set of 6253 genes conservedacross laurasiatherian mammals

Transcriptome Sequencing and AssemblyIn order to aid in our gene-finding efforts for the humpbackwhale genome assembly and to measure gene expression wegenerated transcripts from skin tissue by extracting total RNAusing the QIAzol Lysis Reagent (Qiagen) followed by purifi-cation on RNeasy spin columns (Qiagen) RNA integrity andquantity were determined on the Agilent 2100 Bioanalyzer(Agilent) using the manufacturerrsquos protocol The total RNAwas treated with DNase using DNase mix from the RecoverAllTotal Nucleic Acid Isolation kit (Applied BiosystemsAmbion) The RNA library was prepared and sequenced bythe Genome Technology Center at the University ofCalifornia Santa Cruz including cDNA synthesis with theOvation RNA-Seq system V2 (Nugen) and RNA amplificationas described previously (Tariq et al 2011) We used 05ndash1mgof double-stranded cDNA for library preparation shearedusing the Covaris S2 size-selected for 350ndash450 bp using au-tomated electrophoretic DNA fractionation system(LabChipXT Caliper Life Sciences) Paired-end sequencing li-braries were constructed using Illumina TruSeq DNA SamplePreparation Kit Following library construction samples werequantified using the Bioanalyzer and sequenced on theIllumina HiSeq 2000 platform to produce 2 100 bp se-quencing reads We then used Trinity (Grabherr et al 2011)to assemble the adapter-trimmed RNA-Seq reads intotranscripts

Genome AnnotationWe generated gene models for the humpback whale usingmultiple iterations of MAKER2 (Holt and Yandell 2011) whichincorporated 1) direct evidence from the Trinity-assembledtranscripts 2) homology to NCBI proteins from ten mammals(human mouse dog cow sperm whale bottlenose dolphinorca bowhead whale common minke whale and baiji) andUniProtKBSwiss-Prot (UniProt Consortium 2015) and 3) abinitio gene predictions using SNAP (11292013 release Korf2004) and Augustus v302 (Stanke et al 2008) A detaileddescription of the annotation pipeline is provided in the sup-plementary Methods Supplementary Material online Finalgene calls were annotated functionally by BlastP similarity to

UniProt proteins (UniProt Consortium 2015) with an e-valuecutoff of 1e-6

Repeat Annotation and Evolutionary AnalysisTo analyze the repetitive landscape of the humpback whalegenome we used both database and de novo modelingmethods For the database method we ran RepeatMaskerv405 (httpwwwrepeatmaskerorg accessed August 212017) (Smit et al 2015a) on the final assembly indicatingthe ldquomammaliardquo repeat library from RepBase (Jurka et al2005) For the de novo method we scanned the assemblyfor repeats using RepeatModeler v108 (httpwwwrepeat-maskerorg) (Smit et al 2015b) the results of which were thenclassified using RepeatMasker To estimate evolutionary di-vergence within repeat subfamilies in the humpback whalegenome we generated repeat-family-specific alignments andcalculated the average Kimura-2-parameter divergence fromconsensus within each family correcting for high mutationrates at CpG sites with the calcDivergenceFromAlignplRepeatMasker tool We compared the divergence profile ofhumpback whale and bowhead whale by completing parallelanalyses and the repetitive landscapes of orca and bottlenosedolphin are available from the RepeatMasker server (httpwwwrepeatmaskerorgspecies accessed August 21 2017)

Analysis of Gene Expression Using RNA-SeqSplice-wise mapping of RNA-Seq reads against the humpbackwhale genome assembly and annotation was carried out us-ing STAR v24 (Dobin et al 2013) and we counted the num-ber of reads mapping to gene annotations We also mappedthe skin RNA-Seq data to the database of annotated hump-back whale transcripts using local alignments with bowtiev225 (Langmead and Salzberg 2012) and used stringtiev134 (Pertea et al 2015) to calculate gene abundances bytranscripts per million

Analysis of Segmental Duplications in CetaceanGenomesIn order to detect LSDs in several cetacean genomes we ap-plied an approach based on depth of coverage (Alkan et al2009) To this end we used whole-genome shotgun sequencedata from the current study as well as from other cetaceangenomics projects All data were mapped against the hump-back whale reference assembly A detailed description of thesegmental duplication analysis is provided in the supplemen-tary Methods Supplementary Material online

Whole-Genome AlignmentsWe generated WGAs of 12 mammals (supplementary table 2Supplementary Material online) First we generated pairwisesyntenic alignments of each species as a query to the humangenome (hg19) as a target using LASTZ v102 (Harris 2007)followed by chaining to form gapless blocks and netting torank the highest scoring chains (Kent et al 2003) The pairwisealignments were used to construct a multiple sequence align-ment with MULTIZ v112 (Blanchette et al 2004) with humanas the reference species We filtered the MULTIZ alignment to

Tollis et al doi101093molbevmsz099 MBE

1758

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 15: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

only contain aligned blocks from at least 10 out of the 12species (81 complete)

Phylogenetic Reconstruction Using Single-CopyOrthologsWe downloaded the coding DNA sequences from 28 publiclyavailable mammalian genome assemblies (supplementary ta-ble 9 Supplementary Material online) and used VESPA(Webb et al 2017) to obtain high-confidence SGOs (supple-mentary Methods Supplementary Material online) For phy-logenetic analysis we filtered the SGO data set to include onlyloci that were represented by at least 24 out of the 28 mam-malian species (86 complete) and reconstructed each genetree using maximum likelihood in PhyML v30 (Guindon et al2010) with an HKY85 substitution model and 100 bootstrapreplicates to assess branch support The gene trees were thenbinned and used to reconstruct a species tree using the ac-curate species tree algorithm (ASTRAL-III v56 Zhang et al2018) ASTRAL utilizes the multispecies coalescent modelthat incorporates incomplete lineage sorting and finds thespecies tree stemming from bipartitions predefined by thegene trees Branch support for the species tree was assessedwith local posterior probabilities and branch lengths werepresented in coalescent units where shorter branch lengthsindicate greater gene tree discordance (Sayyari and Mirarab2016)

Rates of Molecular Evolution and Divergence TimeEstimationWe used multiple approaches on independent data sets toestimate rates of molecular evolution and the divergencetimes of the major mammalian lineages including six modernwhales with complete genome assemblies We first focusedon 4-fold degenerate (4D) sites which are positions withincodon alignments where substitutions result in no aminoacid change and can be used to approximate the neutralrate of evolution (Kumar and Subramanian 2002) We usedthe Ensembl human gene annotation to extract codingregions from the 12-mammal WGA using msa_view inPHAST v14 (Hubisz et al 2011) We reconstructed the phy-logeny with the 4D data as a single partition in RAxML v83(Stamatakis 2014) under the GTRGAMMA substitutionmodel and assessed branch support with 10000 bootstrapsRates of molecular evolution were estimated on the 4D dataset with the semiparametric PL method implemented in r8sv18 (Sanderson 2002 2003) A detailed description of the PLmethod is given in the supplementary MethodsSupplementary Material online

We also used the approximate likelihood calculation inMCMCtree (Yang and Rannala 2006) to estimate divergencetimes using independent data sets 1) the above-mentioned4D data set derived from the WGA as well as 2) a set of theSGOs that included 24 out of 28 sampled taxa (86) and waspartitioned into three codon positions We implemented theHKY85 substitution model multiple fossil-based priors (sup-plementary table 10 Supplementary Material online Mitchell1989 Benton et al 2015 Hedges et al 2015) and independentrates (ldquoclockfrac14 3rdquo) along branches All other parameters were

set as defaults For each MCMCtree analysis we ran the anal-ysis three times with different starting seeds and modified theMarkov chain Monte Carlo (MCMC) length and samplingfrequency in order to achieve proper chain convergencemonitored with Tracer v17 We achieved proper MCMCconvergence on the 4D data set after discarding the first500000 steps as burn-in and sampling every 2000 steps untilwe collected 20000 samples We achieved proper MCMCconvergence on the SGO data set after discarding the first500000 steps as burn-in and sampling every 10000 steps untilwe collected 10000 samples

Demographic AnalysisWe used the PSMC (Li and Durbin 2011) to reconstruct thepopulation history of North Atlantic humpback whales in-cluding the individual sequenced in the current study (down-sampled to 20 coverage) and a second individualsequenced at 17 coverage in Arnason et al (2018) Adetailed description of the PSMC analysis is provided in thesupplementary Methods Supplementary Material online

Nonneutral Substitution Rates in Cetacean GenomesIn order to identify genomic regions controlling cetacean-specific adaptations we used phyloP (Pollard et al 2010) todetect loci in the 12-mammal WGA that depart from neutralexpectations (see supplementary Methods SupplementaryMaterial online) We then collected accelerated regions thatoverlapped human whole gene annotations (hg19) using bed-tools intersect (Quinlan and Hall 2010) and tested for theenrichment of GO terms using the PANTHER analysis toolavailable at the Gene Ontology Consortium website (GOOntology database last accessed June 2017) (GeneOntology Consortium 2015)

Detection of Protein-Coding Genes Subjected toPositive SelectionIn order to measure selective pressures acting on protein-coding genes during cetacean evolution with an emphasison the evolution of cancer suppression we estimated theratio of nonsynonymous to synonymous substitutions (dNdS) To maximize statistical power in pairwise comparisonsgiven the number of available cetacean genomes (six lastaccessed September 2017) we implemented phylogenetictargeting (Arnold and Nunn 2010) assuming a phylogenyfrom a mammalian supertree (Fritz et al 2009) To selectgenome assemblies most suitable for assessing PetorsquosParadox we weighted scores for contrasts with a lot of changein the same direction for both body mass and maximumlongevity Trait values were taken from panTHERIA (Joneset al 2009) and we selected maximal pairings based on thestandardized summed scores We then generated pairwisegenome alignments as described above based on the phylo-genetic targeting results For each pairwise genome align-ment we stitched gene blocks in Galaxy (Blankenberg et al2011) according to the target genome annotations produc-ing alignments of one-to-one orthologs which were filteredto delete frameshift mutations and replace internal stopcodons with gaps We then estimated pairwise dNdS for every

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1759

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 16: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

orthologous gene pair with KaKs_Calculator v20 (Wang et al2010) To link genes with dNdS gt1 to potential phenotypeswe used orthologous human Ensembl gene IDs to collect GOterms in BioMart (Kinsella et al 2011) and tested for enrich-ment of overrepresented GO terms

We also used codon-based models to test for selectivepressure variation along branches of the cetacean phylog-eny in comparison to other mammal lineages also knownas the branch-site test (Yang 2007) First the known spe-cies phylogeny (Morgan et al 2013 Tarver et al 2016) waspruned to correspond to the species present in each SGOfamily SGO nucleotide alignments that contained morethan seven species were analyzed for selective pressurevariation This is to reduce the risk of detecting falsepositives (Anisimova et al 2001 2002) In general thebranch-site test is a powerful yet conservative approach(Gharib and Robinson-Rechavi 2013) although modelmisspecification and alignment errors can greatly increasethe number of false positives (Anisimova et al 20012002) Recent studies have concluded that many pub-lished inferences of adaptive evolution using the branchsite test may be affected by artifacts (Venkat et al 2018)Therefore extensive filtering is necessary in order to makereasonably sound conclusions from results of the branchsite test A detailed description of all tested models andthe filtering process are given in the supplementaryMethods Supplementary Material online In total 1152gene families were analyzed We carried out the branch-site test using PAML v44e (Yang 2007) The following fivebranches were assessed as foreground humpback whalethe most recent common ancestor (MRCA) of the com-mon minke and humpback whales MRCA of baleenwhales MRCA of toothed whales and the MRCA of allwhales (cetacean stem lineage) For each model we keptall genes that met a significance threshold of Plt 005 aftera Bonferonni correction for multiple hypothesis testingusing the total number of branch genes (five foregroundbranches1152 genes) We also corrected the raw P-val-ues from the likelihood ratio tests of every gene by theFDR correction where qfrac14 005 (Benjamini and Hochberg1995) The Bonferroni correction is more conservativethan the FDR but sufficient for multiple hypothesis test-ing of lineage-specific positive selection in these casesthe FDR results in higher probabilities of rejecting truenull hypotheses (Anisimova and Yang 2007) Thereforewe used Bonferroni-corrected results in downstreamanalyses but also report FDR-corrected P-values infigure 4C Genes were identified based on the humanortholog (Ensembl gene ID) and we performed gene an-notation enrichment analysis and functional annotationclustering with DAVID v68 (Huang et al 2009a 2009b) aswell as semantic clustering of GO terms using REVIGO(Supek et al 2011) We also searched for interactions ofpositively selected proteins using STRING v105(Szklarczyk et al 2017) with default parameters andtested for the enrichment of overrepresented GO termsas above and for associated mouse phenotypes usingmodPhEA (Weng and Liao 2017)

Data AvailabilityAll data that contributed to the results of the study are madepublicly available The genomic sequencing RNA sequencingas well as the genome assembly for humpback whale(GCA_0043293851) are available under NCBI BioProjectPRJNA509641 The gene annotation orthologous gene setspositive selection results segmental duplication annotationsand whole-genome alignments used in this study are availableat the Harvard Dataverse (httpsdoiorg107910DVNADHX1O)

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsField data collection was supported by the Center for CoastalStudies and conducted under NOAA research permit 16325This work was initiated with funding from Dr Jeffrey Pearl atUC San Fransisco This work was supported in part by NIH(Grants U54 CA217376 U2C CA233254 P01 CA91955 R01CA149566 R01 CA170595 R01 CA185138 and R01CA140657) as well as CDMRP Breast Cancer ResearchProgram (Award BC132057) and the Arizona BiomedicalResearch Commission (Grant ADHS18-198847) to CCMThe findings opinions and recommendations expressedhere are those of the authors and not necessarily those ofthe universities where the research was performed or theNational Institutes of Health PJP would like to acknowledgefunding from Stockholm University and the University ofGroningen Startup funds were provided by the School ofInformatics Computing and Cyber Systems at NorthernArizona University (MT) MJOrsquoC would like to thank theUniversity of Leeds funding her 250 Great Minds UniversityAcademic Fellowship The authors thank Wensi Hao(University of Groningen) for technical support AndreaCabrera (University of Groningen) for helpful conversationsabout mutation rates and demography and Richard E Green(UC Santa Cruz) for consultations regarding genome se-quencing assembly and annotation The authors acknowl-edge Research Computing at Arizona State University (httpwwwresearchcomputingasuedu) the Advanced ResearchComputing resources at the University of Leeds and theMonsoon computing cluster at Northern ArizonaUniversity (httpsnaueduhigh-performance-computing)for providing high-performance computing and storageresources that have contributed to this study

ReferencesAbegglen LM Caulin AF Chan A Lee K Robinson R Campbell MS Kiso

WK Schmitt DL Waddell PJ Bhaskara S 2015 Potential mechanismsfor cancer resistance in elephants and comparative cellular responseto DNA damage in humans JAMA 314(17)1850ndash1860

Alkan C Kidd JM Marques-Bonet T Aksay G Antonacci F HormozdiariF Kitzman JO Baker C Malig M Mutlu O 2009 Personalized copynumber and segmental duplication maps using next-generation se-quencing Nat Genet 41(10)1061ndash1067

Tollis et al doi101093molbevmsz099 MBE

1760

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 17: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

Anisimova M Bielawski JP Yang Z 2001 Accuracy and power of thelikelihood ratio test in detecting adaptive molecular evolution MolBiol Evol 18(8)1585ndash1592

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19(6)950ndash958

Anisimova M Yang Z 2007 Multiple hypothesis testing to detect line-ages under positive selection that affects only a few sites Mol BiolEvol 24(5)1219ndash1228

Arnason U Lammers F Kumar V Nilsson MA Janke A 2018 Whole-genome sequencing of the blue whale and other rorquals findssignatures for introgressive gene flow Sci Adv 4(4)eaap9873

Arnason U Widegren B 1989 Composition and chromosomal localiza-tion of cetacean highly repetitive DNA with special reference to theblue whale Balaenoptera musculus Chromosoma 98(5)323ndash329

Arnold C Nunn CL 2010 Phylogenetic targeting of research effort inevolutionary biology Am Nat 176(5)601ndash612

Baker CS Perry A Bannister JL Weinrich MT Abernethy RBCalambokidis J Lien J Lambertsen RH Ramirez JU Vasquez O1993 Abundant mitochondrial DNA variation and world-wide pop-ulation structure in humpback whales Proc Natl Acad Sci U S A90(17)8239ndash8243

Baker J Meade A Pagel M Venditti C 2015 Adaptive evolution towardlarger size in mammals Proc Natl Acad Sci U S A 112(16)5093ndash5098

Benjamini Y Hochberg Y 1995 Controlling the false discovery rate apractical and powerful approach to multiple testing J R Stat Soc57289ndash300

Benson RBJ Campione NE Carrano MT Mannion PD Sullivan CUpchurch P Evans DC 2014 Rates of dinosaur body mass evolutionindicate 170 million years of sustained ecological innovation on theavian stem lineage PLoS Biol 12(5)e1001853

Benton MJ Donoghue PCJ Asher RJ Friedman M Near TJ Vinther J2015 Constraints on the timescale of animal evolutionary historyPalaeontol Electron 181ndash106

Berta A Sumich JL Kovacs KM 2015 Marine mammals evolutionarybiology 3rd ed Cambridge (MA) Academic Press

Blanchette M Kent WJ Riemer C Elnitski L Smit AFA Roskin KMBaertsch R Rosenbloom K Clawson H Green ED 2004 Aligningmultiple genomic sequences with the threaded blockset alignerGenome Res 14(4)708ndash715

Blankenberg D Taylor J Nekrutenko A Galaxy T 2011 Making wholegenome multiple alignments usable for biologists Bioinformatics27(17)2426ndash2428

Boissinot S Sookdeo A 2016 The evolution of LINE-1 in vertebratesGenome Biol Evol 8(12)3485ndash3507

Bolger AM Lohse M Usadel B 2014 Trimmomatic a flexible trimmer forIllumina sequence data Bioinformatics 30(15)2114ndash2120

Browning CL Wise CF Wise JP 2017 Prolonged particulate chromateexposure does not inhibit homologous recombination repair inNorth Atlantic right whale (Eubalaena glacialis) lung cells ToxicolAppl Pharmacol 33118ndash23

Carbone L Harris RA Gnerre S Veeramah KR Lorente-Galdos BHuddleston J Meyer TJ Herrero J Roos C Aken B et al 2014Gibbon genome and the fast karyotype evolution of small apesNature 513(7517)195ndash201

Caulin AF Graham TA Wang L-S Maley CC 2015 Solutions to Petorsquosparadox revealed by mathematical modelling and cross-species can-cer gene analysis Philos Trans R Soc B 370(1673)20140222

Caulin AF Maley CC 2011 Petorsquos Paradox evolutionrsquos prescription forcancer prevention Trends Ecol Evol (Amst) 26(4)175ndash182

Chapman JA Ho I Sunkara S Luo S Schroth GP Rokhsar DS 2011Meraculous de novo genome assembly with short paired-end readsPLoS One 6(8)e23501

Chittleborough RG 1959 Determination of age in the humpback whaleMegaptera nodosa (Bonnaterre) Aust J Mar Freshw Res10(2)125ndash143

Clapham PJ Mead JG 1999 Megaptera novaeangliae Mamm Species6041ndash9

Clement V Dunand-Sauthier I Clarkson SG 2006 Suppression of UV-induced apoptosis by the human DNA repair protein XPG CellDeath Differ 13(3)478ndash488

Dobin A Davis CA Schlesinger F Drenkow J Zaleski C Jha S Batut PChaisson M Gingeras TR 2013 STAR ultrafast universal RNA-Seqaligner Bioinformatics 29(1)15ndash21

Fay JC Wu C-I 2003 Sequence divergence functional constraint andselection in protein evolution Annu Rev Genomics Hum Genet4213ndash235

Ferris E Abegglen LM Schiffman JD Gregg C 2018 Accelerated evolu-tion in distinctive species reveals candidate elements for clinicallyrelevant traits including mutation and cancer resistance Cell Rep22(10)2742ndash2755

Fog CK Galli GG Lund AH 2012 PRDM proteins important players indifferentiation and disease Bioessays 34(1)50ndash60

Foley NM Hughes GM Huang Z Clarke M Jebb D Whelan CV Petit EJTouzalin F Farcy O Jones G et al 2018 Growing old yet stayingyoung the role of telomeres in batsrsquo exceptional longevity Sci Adv4(2)eaao0926

Foote AD Liu Y Thomas GWC Vinar T Alfoldi J Deng J Dugan Svan Elk CE Hunter ME Joshi V et al 2015 Convergent evolu-tion of the genomes of marine mammals Nat Genet47(3)272ndash275

Forbes SA Beare D Gunasekaran P Leung K Bindal N Boutselakis HDing M Bamford S Cole C Ward S et al 2015 COSMIC exploringthe worldrsquos knowledge of somatic mutations in human cancerNucleic Acids Res 43(D1)D805ndashD811

Fritz SA Bininda-Emonds ORP Purvis A 2009 Geographical variation inpredictors of mammalian extinction risk big is bad but only in thetropics Ecol Lett 12(6)538ndash549

Futreal PA Coin L Marshall M Down T Hubbard T Wooster R RahmanN Stratton MR 2004 A census of human cancer genes Nat RevCancer 4(3)177ndash183

Gabriele C Lockyer C Straley JM Jurasz CM Kato H 2010 Sightinghistory of a naturally marked humpback whale (Megaptera novaean-gliae) suggests ear plug growth layer groups are deposited annuallyMar Mam Sci 26(2)443ndash450

Gauthier JM Metcalfe CD Sears R 1997 Chlorinated organic contam-inants in blubber biopsies from northwestern Atlantic balaenopteridwhales summering in the Gulf of St Lawrence Mar Environ Res44(2)201ndash223

Gene Ontology Consortium 2015 Gene Ontology Consortium goingforward Nucleic Acids Res 43D1049ndashD1056

George JC Bada J Zeh J Scott L Brown SE OrsquoHara T Suydam R 1999Age and growth estimates of bowhead whales (Balaena mysticetus)via aspartic acid racemization Can J Zool 77(4)571ndash580

Gharib WH Robinson-Rechavi M 2013 The branch-site test of positiveselection is surprisingly robust but lacks power under synonymoussubstitution saturation and variation in GC Mol Biol Evol30(7)1675ndash1686

Grabherr MG Haas BJ Yassour M Levin JZ Thompson DA Amit IAdiconis X Fan L Raychowdhury R Zeng Q et al 2011 Full-lengthtranscriptome assembly from RNA-Seq data without a referencegenome Nat Biotechnol 29(7)644ndash652

Guindon S Dufayard J-F Lefort V Anisimova M Hordijk W Gascuel O2010 New algorithms and methods to estimate maximum-likelihood phylogenies assessing the performance of PhyML 30Syst Biol 59(3)307ndash321

Harris RS 2007 Improved pairwise alignment of genomic DNA [PhDthesis] The Pennsylvania State University State College PA USA

Hedges SB Marin J Suleski M Paymer M Kumar S 2015 Tree of lifereveals clock-like speciation and diversification Mol Biol Evol32(4)835ndash845

Heim NA Knope ML Schaal EK Wang SC Payne JL 2015 Copersquos rule inthe evolution of marine animals Science 347(6224)867ndash870

Holt C Yandell M 2011 MAKER2 an annotation pipeline and genome-database management tool for second-generation genome projectsBMC Bioinformatics 12491

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1761

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 18: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

Huang DW Sherman BT Lempicki RA 2009a Bioinformatics enrich-ment tools paths toward the comprehensive functional analysis oflarge gene lists Nucleic Acids Res 37(1)1ndash13

Huang DW Sherman BT Lempicki RA 2009b Systematic and integra-tive analysis of large gene lists using DAVID bioinformatics resourcesNat Protoc 4(1)44ndash57

Hubisz MJ Pollard KS Siepel A 2011 PHAST and RPHAST phylogeneticanalysis with spacetime models Brief Bioinform 12(1)41ndash51

Hudson RR 1990 Gene genealogies and the coalescent process InFutuyama DJ Antonovics J editors Oxfords surveys in evolutionarybiology Oxford Oxford University Press p 1ndash44

Ichishima H 2016 The ethmoid and presphenoid of cetaceans JMorphol 277(12)1661ndash1674

Jackson JA Baker CS Vant M Steel DJ Medrano-Gonzalez LPalumbi SR 2009 Big and slow phylogenetic estimates of mo-lecular evolution in baleen whales (suborder Mysticeti) MolBiol Evol 26(11)2427ndash2440

Jackson JA Steel DJ Beerli P Congdon BC Olavarria C Leslie MS PomillaC Rosenbaum H Baker CS 2014 Global diversity and oceanic di-vergence of humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 281(1786)20133222

Jones KE Bielby J Cardillo M Fritz SA OrsquoDell J Orme CDL Safi KSechrest W Boakes EH Carbone C et al 2009 PanTHERIA a spe-cies-level database of life history ecology and geography of extantand recently extinct mammals Ecology 90(9)2648ndash2648

Jurka J Kapitonov VV Pavlicek A Klonowski P Kohany O WalichiewiczJ 2005 Repbase Update a database of eukaryotic repetitive ele-ments Cytogenet Genome Res 110(1ndash4)462ndash467

Kaessmann H 2010 Origins evolution and phenotypic impact of newgenes Genome Res 20(10)1313ndash1326

Kastan MB Bartek J 2004 Cell-cycle checkpoints and cancer Nature432(7015)316ndash323

Katona SK Whitehead HP 1981 Identifying humpback whales usingtheir natural markings Polar Rec 20(128)439ndash444

Keane M Semeiks J Webb AE Li YI Quesada V Craig T Madsen LB vanDam S Brawand D Marques PI et al 2015 Insights into the evolu-tion of longevity from the bowhead whale genome Cell Rep10(1)112ndash122

Kent WJ Baertsch R Hinrichs A Miller W Haussler D 2003 Evolutionrsquoscauldron duplication deletion and rearrangement in the mouseand human genomes Proc Natl Acad Sci U S A100(20)11484ndash11489

Kingsolver JG Pfennig DW 2004 Individual-level selection as acause of Copersquos rule of phyletic size increase Evolution58(7)1608ndash1612

Kinsella RJ Keuroaheuroari A Haider S Zamora J Proctor G Spudich G Almeida-King J Staines D Derwent P Kerhornou A et al 2011 EnsemblBioMarts a hub for data retrieval across taxonomic spaceDatabase (Oxford) 2011bar030

Korf I 2004 Gene finding in novel genomes BMC Bioinformatics 559Kumar S Subramanian S 2002 Mutation rates in mammalian genomes

Proc Natl Acad Sci U S A 99(2)803ndash808Lambertsen RH 1987 A biopsy system for large whales and its use for

cytogenetics J Mammal 68(2)443ndash445Langmead B Salzberg SL 2012 Fast gapped-read alignment with Bowtie

2 Nat Methods 9(4)357ndash359Larsen AH Sigurjonsson J Oslashien N Vikingsson G Palsboslashll PJ 1996

Populations genetic analysis of nuclear and mitochondrial loci inskin biopsies collected from central and northeastern NorthAtlantic humpback whales (Megaptera novaeangliae) Proc R SocLond B Biol Sci 2631611ndash1618

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357)493ndash496

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup 2009 The Sequence AlignmentMap format andSAMtools Bioinformatics 25(16)2078ndash2079

Martineau D Lemberger K Dallaire A Labelle P Lipscomb TP Michel PMikaelian I 2002 Cancer in wildlife a case study beluga from the St

Lawrence estuary Quebec Canada Environ Health Perspect110(3)285ndash292

Mazet O Rodrıguez W Grusea S Boitard S Chikhi L 2016 On theimportance of being structured instantaneous coalescence ratesand human evolution ndash lesson for ancestral population size infer-ence Heredity 116(4)362ndash371

Michaud EJ Yoder BK 2006 The primary cilium in cell signaling andcancer Cancer Res 66(13)6463ndash6467

Milholland B Dong X Zhang L Hao X Suh Y Vijg J 2017 Differencesbetween germline and somatic mutation rates in humans and miceNat Commun 815183

Mitchell ED 1989 A new cetacean from the Late Eocene La MesetaFormation Seymour Island Antarctic Peninsula Can J Fish Aquat Sci46(12)2219ndash2235

Morgan CC Foster PG Webb AE Pisani D McInerney JO OrsquoConnell MJ2013 Heterogeneous models place the root of the placental mam-mal phylogeny Mol Biol Evol 30(9)2145ndash2156

Nadachowska-Brzyska K Burri R Smeds L Ellegren H 2016 PSMC anal-ysis of effective population sizes in molecular ecology and its appli-cation to black-and-white Ficedula flycatchers Mol Ecol25(5)1058ndash1072

Newman SJ Smith SA 2006 Marine mammal neoplasia a review VetPathol 43(6)865ndash880

Nunney L 2018 Size matters height cell number and a personrsquos risk ofcancer Proc R Soc B 285(1889)20181743

Ohsumi S 1979 Interspecies relationships among some biologicalparameters in cetaceans and estimation of the natural mortalitycoefficient of the Southern Hemisphere minke whales Rep IntWhaling Comm 29397ndash406

OrsquoLeary MA Gatesy J 2008 Impact of increased character sampling onthe phylogeny of Cetartiodactyla (Mammalia) combined analysisincluding fossils Cladistics 24397ndash442

Pagel M Venditti C Meade A 2006 Large punctuational contribution ofspeciation to evolutionary divergence at the molecular level Science314(5796)119ndash121

Palsboslashll PJ Clapham PJ Mattila DK Larsen F Sears R Siegismund HRSigurjonsson J Vasquez O Arctander P 1995 Distribution ofmtDNA haplotypes in North Atlantic humpback whales the influ-ence of behavior on population structure Mar Ecol Prog Ser1161ndash10

Palsboslashll PJ Larsen F Hansen ES 1991 Sampling of skin biopsies fromfree-raging large cetaceans in West Greenland development of newbiopsy tips and bolt designs Int Whaling Comm Spec Issue Ser13311

Palsboslashll PJ Zachariah Peery M Olsen MT Beissinger SR Berube M 2013Inferring recent historic abundance from current genetic diversityMol Ecol 22(1)22ndash40

Parra G Bradnam K Ning Z Keane T Korf I 2009 Assessing the genespace in draft genomes Nucleic Acids Res 37(1)289ndash297

Pertea M Pertea GM Antonescu CM Chang T-C Mendell JT SalzbergSL 2015 StringTie enables improved reconstruction of a transcrip-tome from RNA-Seq reads Nat Biotechnol 33(3)290ndash295

Peto R Roe FJ Lee PN Levy L Clack J 1975 Cancer and ageing in miceand men Br J Cancer 32(4)411ndash426

Pio R Corrales L Lambris JD 2014 The role of complement in tumorgrowth Adv Exp Med Biol 772229ndash262

Pollard KS Hubisz MJ Rosenbloom KR Siepel A 2010 Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 20(1)110ndash121

Putnam NH OrsquoConnell BL Stites JC Rice BJ Blanchette M Calef R TrollCJ Fields A Hartley PD Sugnet CW et al 2016 Chromosome-scaleshotgun assembly using an in vitro method for long-range linkageGenome Res 26(3)342ndash350

Quinlan AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26(6)841ndash842

Ruegg K Rosenbaum HC Anderson EC Engel M Rothschild A Baker CSPalumbi SR 2013 Long-term population size of the North Atlantichumpback whale within the context of worldwide population struc-ture Conserv Genet 14(1)103ndash114

Tollis et al doi101093molbevmsz099 MBE

1762

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8
Page 19: University of Groningen Return to the sea, get huge, …...Return to the Sea, Get Huge, Beat Cancer: An Analysis of Cetacean Genomes Including an Assembly for the Humpback Whale (Megaptera

Sanderson MJ 2002 Estimating absolute rates of molecular evolutionand divergence times a penalized likelihood approach Mol Biol Evol19(1)101ndash109

Sanderson MJ 2003 r8s inferring absolute rates of molecular evolutionand divergence times in the absence of a molecular clockBioinformatics 19(2)301ndash302

Sayyari E Mirarab S 2016 Fast coalescent-based computation of localbranch support from quartet frequencies Mol Biol Evol33(7)1654ndash1668

Shadat NMA Koide N Khuda II-E Dagvadorj J Tumurkhuu G Naiki YKomatsu T Yoshida T Yokochi T 2010 Retinoblastoma protein-interacting zinc finger 1 (RIZ1) regulates the proliferation of mono-cytic leukemia cells via activation of p53 Cancer Invest28(8)806ndash812

Sim~ao FA Waterhouse RM Ioannidis P Kriventseva EV Zdobnov EM2015 BUSCO assessing genome assembly and annotation com-pleteness with single-copy orthologs Bioinformatics31btv351ndashbtv3212

Slater GJ Goldbogen JA Pyenson ND 2017 Independent evolution ofbaleen whale gigantism linked to Plio-Pleistocene ocean dynamicsProc R Soc B 284(1855)20170546

Smit AFA Hubley RM Green P 2015a RepeatMasker Open-40 2013-2015 Available from httpwwwrepeatmaskerorg

Smit AFA Hubley RM Green P 2015b RepeatModeler Open-10 2008-2015 Available from httpwwwrepeatmaskerorg

Smith TD Allen J Clapham PJ Hammond PS Katona S Larsen F Lien JMattila D Palsboslashll PJ Sigurjonsson J et al 1999 An ocean-basin-widemark-recapture study of the North Atlantic humpback whale(Megaptera novaeangliae) Mar Mamm Sci 51ndash32

Stamatakis A 2014 RAxML version 8 a tool for phylogenetic analysisand post-analysis of large phylogenies Bioinformatics30(9)1312ndash1313

Stanke M Diekhans M Baertsch R Haussler D 2008 Using native andsyntenically mapped cDNA alignments to improve de novo genefinding Bioinformatics 24(5)637ndash644

Steeman ME Hebsgaard MB Fordyce RE Ho SYW Rabosky DL NielsenR Rahbek C Glenner H Soslashrensen MV Willerslev E 2009 Radiationof extant cetaceans driven by restructuring of the oceans Syst Biol58(6)573ndash585

Sulak M Fong L Mika K Chigurupati S Yon L Mongan NP Emes RDLynch VJ 2016 TP53 copy number expansion is associated with theevolution of increased body size and an enhanced DNA damageresponse in elephants Elife 51850

Supek F Bosnjak M Skunca N Smuc T 2011 REVIGO summarizes andvisualizes long lists of gene ontology terms PLoS One 6(7)e21800

Szklarczyk D Morris JH Cook H Kuhn M Wyder S Simonovic M SantosA Doncheva NT Roth A Bork P et al 2017 The STRING database in2017 quality-controlled protein-protein association networks madebroadly accessible Nucleic Acids Res 45(D1)D362ndashD368

Tacutu R Craig T Budovsky A Wuttke D Lehmann G Taranukha DCosta J Fraifeld VE de Magalh~aes JP 2012 Human Ageing GenomicResources integrated databases and tools for the biology and genet-ics of ageing Nucleic Acids Res 41(D1)D1027ndashD1033

Tariq MA Kim HJ Jejelowo O Pourmand N 2011 Whole-transcriptomeRNAseq analysis from minute amount of total RNA Nucleic AcidsRes 39(18)e120

Tarver JE Dos Reis M Mirarab S Moran RJ Parker S OrsquoReilly JE King BLOrsquoConnell MJ Asher RJ Warnow T et al 2016 The interrelationshipsof placental mammals and the limits of phylogenetic inferenceGenome Biol Evol 8(2)330ndash344

Thewissen JGM Cohn MJ Stevens LS Bajpai S Heyning J Horton WE2006 Developmental basis for hind-limb loss in dolphins and origin

of the cetacean bodyplan Proc Natl Acad Sci U S A103(22)8414ndash8418

Thewissen JGM Cooper LN Clementz MT Bajpai S Tiwari BN 2007Whales originated from aquatic artiodactyls in the Eocene epoch ofIndia Nature 450(7173)1190ndash1194

Tollis M Schiffman JD Boddy AM 2017 Evolution of cancer suppressionas revealed by mammalian comparative genomics Curr Opin GenetDev 4240ndash47

Trego KS Groesser T Davalos AR Parplys AC Zhao W Nelson MRHlaing A Shih B Rydberg B Pluth JM et al 2016 Non-catalytic rolesfor XPG with BRCA1 and BRCA2 in homologous recombination andgenome stability Mol Cell 61(4)535ndash546

UniProt Consortium 2015 UniProt a hub for protein informationNucleic Acids Res 43D204ndashD212

Valsecchi E Palsboslashll PJ Hale P Glockner-Ferrari D Ferrari M Clapham PLarsen F Mattila DK Sears R Sigurjonsson J et al 1997 Microsatellitegenetic distances between oceanic opulations of the humpbackwhale (Megaptera novaeangliae) Mol Biol Evol 14355ndash362

Venkat A Hahn MW Thornton JW 2018 Multinucleotide mutationscause false inferences of lineage-specific positive selection Nat EcolEvol 2(8)1280ndash1288

Vermeij GJ 2016 Gigantism and its implications for the history of lifePLoS One 11(1)e0146092

Wang D Zhang Y Zhang Z Zhu J Yu J 2010 KaKs_Calculator 20 atoolkit incorporating gamma-series methods and sliding windowstrategies Genomics Proteomics Bioinformatics 8(1)77ndash80

Warren WC Kuderna L Alexander A Catchen J Perez-Silva JG Lopez-Otın C Quesada V Minx P Tomlinson C Montague MJ et al 2017The novel evolution of the sperm whale genome Genome Biol Evol9(12)3260ndash3264

Webb AE Walsh TA OrsquoConnell MJ 2017 VESPA very large-scale evo-lutionary and selective pressure analyses PeerJ Comput Sci 3e118

Weng M-P Liao B-Y 2017 modPhEA model organism PhenotypeEnrichment Analysis of eukaryotic gene sets Bioinformatics33(21)3505ndash3507

Wilson Sayres MA Venditti C Pagel M Makova KD 2011 Do variationsin substitution rates and male mutation bias correlate with life-history traits A study of 32 mammalian genomes Evolution65(10)2800ndash2815

Yang Z 1998 Likelihood ratio tests for detecting positive selection andapplication to primate lysozyme evolution Mol Biol Evol15(5)568ndash573

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 24(8)1586ndash1591

Yang Z Rannala B 2006 Bayesian estimation of species divergence timesunder a molecular clock using multiple fossil calibrations with softbounds Mol Biol Evol 23(1)212ndash226

Yim H-S Cho YS Guang X Kang SG Jeong J-Y Cha S-S Oh H-M Lee J-HYang EC Kwon KK et al 2014 Minke whale genome and aquaticadaptation in cetaceans Nat Genet 46(1)88ndash92

Yu J Tian S Metheny-Barlow L Chew L-J Hayes AJ Pan H Yu G-L LiL-Y 2001 Modulation of endothelial cell growth arrest andapoptosis by vascular endothelial growth inhibitor Circ Res89(12)1161ndash1167

Zhang C Rabiee M Sayyari E Mirarab S 2018 ASTRAL-III polynomialtime species tree reconstruction from partially resolved gene treesBMC Bioinformatics 19(Suppl 6)153

Zhang Q Edwards SV 2012 The evolution of intron size in amniotes arole for powered flight Genome Biol Evol 4(10)1033ndash1043

Zhao M Kim P Mitra R Zhao J Zhao Z 2015 TSGene 20 an updatedliterature-based knowledgebase for tumor suppressor genes NucleicAcids Res 44gkv1268ndashgkD1031

Return to the Sea Get Huge Beat Cancer doi101093molbevmsz099 MBE

1763

Dow

nloaded from httpsacadem

icoupcomm

bearticle-abstract36817465485251 by University of G

roningen user on 17 October 2019

  • msz099-TF1
  • msz099-TF2
  • msz099-TF3
  • msz099-TF4
  • msz099-TF5
  • msz099-TF6
  • msz099-TF7
  • msz099-TF8