83
International Journal of Genomics The Promise of Genomic Studies on Human Diseases: From Basic Science to Clinical Application Guest Editors: Lam C. Tsoi, Bethany Wolf, and Y. Ann Chen

The Promise of Genomic Studies on Human Diseases: From ...downloads.hindawi.com/journals/specialissues/532929.pdf · utilized gene expression profiles to guide the classification

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • International Journal of Genomics

    The Promise of Genomic Studies on Human Diseases: From Basic Science to Clinical Application

    Guest Editors: Lam C. Tsoi, Bethany Wolf, and Y. Ann Chen

  • The Promise of Genomic Studies on HumanDiseases: From Basic Science toClinical Application

  • International Journal of Genomics

    The Promise of Genomic Studies on HumanDiseases: From Basic Science toClinical Application

    Guest Editors: Lam C. Tsoi, BethanyWolf, and Y. Ann Chen

  • Copyright © 2017 Hindawi Publishing Corporation. All rights reserved.

    This is a special issue published in “International Journal of Genomics.” All articles are open access articles distributed under the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the originalwork is properly cited.

  • Editorial Board

    Jacques Camonis, FrancePrabhakara V. Choudary, USAMartine A. Collart, SwitzerlandMarco Gerdol, ItalySoraya E. Gutierrez, ChileM. Hadzopoulou-Cladaras, Greece

    Sylvia Hagemann, AustriaHenry Heng, USAEivind Hovig, NorwayGiuliana Napolitano, ItalyFerenc Olasz, HungaryElena Pasyukova, Russia

    Graziano Pesole, ItalyGiulia Piaggio, ItalyMohamed Salem, USABrian Wigdahl, USAJinfa Zhang, USA

  • Contents

    The Promise of Genomic Studies on Human Diseases: From Basic Science to Clinical ApplicationLam C. Tsoi, Bethany Wolf, and Y. Ann ChenVolume 2017, Article ID 5093167, 2 pages

    A Review of Recent Advancement in Integrating Omics Data with LiteratureMining towardsBiomedical DiscoveriesKalpana Raja, Matthew Patrick, Yilin Gao,Desmond Madu, Yuyang Yang, and Lam C. TsoiVolume 2017, Article ID 6213474, 10 pages

    Integrating Biological Covariates into Gene Expression-Based Predictors of Radiation SensitivityVidya P. Kamath, Javier F. Torres-Roca, and Steven A. EschrichVolume 2017, Article ID 6576840, 9 pages

    Characteristics and Validation Techniques for PCA-Based Gene-Expression SignaturesAnders E. Berglund, Eric A. Welsh, and Steven A. EschrichVolume 2017, Article ID 2354564, 13 pages

    Module Anchored Network Inference: A Sequential Module-Based Approach to Novel Gene NetworkConstruction from Genomic Expression Data on Human Disease MechanismAnnamalai Muthiah, Susanna R. Keller, and Jae K. LeeVolume 2017, Article ID 8514071, 9 pages

    A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing DataJennifer D. Hintzsche, William A. Robinson, and Aik Choon TanVolume 2016, Article ID 7983236, 16 pages

    GPA-MDS: A Visualization Approach to Investigate Genetic Architecture among Phenotypes UsingGWAS ResultsWei Wei, Paula S. Ramos, Kelly J. Hunt, Bethany J. Wolf, Gary Hardiman, and Dongjun ChungVolume 2016, Article ID 6589843, 6 pages

    Embracing Integrative Multiomics ApproachesDaniel M. Rotroff and Alison A. Motsinger-ReifVolume 2016, Article ID 1715985, 5 pages

    Clinical Application of a Modular Genomics Technique in Systemic Lupus Erythematosus: Progresstowards Precision MedicineEric Zollars, Sean M. Courtney, Bethany J. Wolf, Norm Allaire, Ann Ranger, Gary Hardiman,and Michelle PetriVolume 2016, Article ID 7862962, 7 pages

  • EditorialThe Promise of Genomic Studies on Human Diseases: From BasicScience to Clinical Application

    Lam C. Tsoi,1,2,3 Bethany Wolf,4 and Y. Ann Chen5

    1Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA2Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA4Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA5Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA

    Correspondence should be addressed to Lam C. Tsoi; [email protected]

    Received 12 March 2017; Accepted 12 March 2017; Published 29 March 2017

    Copyright © 2017 Lam C. Tsoi et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    1. Introduction

    The advances in biotechnologies and efficiency in computa-tional resources have provided unprecedented opportunitiesto study and analyze the genomics of human diseases. Overthe last decade, high-throughput experiments studying -omics(e.g., genetics, epigenetics, or transcriptomics) have been usedto generate informative data researchers can use to test differ-ent data-driven hypotheses. A big promise of such high-dimensional -omics data is the advancement of biomedicineby effectively translating findings from basic science researchinto clinical application. Designing and conducting genomicexperiments in biomedical research aim to enhance the diag-nosis, treatment, and prevention of human diseases. Translat-ing -omics findings into clinical practice requires a flexibleframework to incorporate different -omics data types topredict clinical outcomes in an integrated fashion.

    2. Data Analysis

    Developing rigorous statistical approaches and implement-ing innovative computational tools play essential roles intranslating the findings based on high-dimensional -omicsdata into accurate and informative medical decisions. Toequip readers with updated analytical approaches, thisspecial issue covers a wide range of analytical approachesand pipelines. J. D. Hintzsche et al. provided a comprehensivereview of computational tools to analyze and interpret thewhole exome sequencing (WES) data, including alignment,

    variant calling, and annotation approaches developed for“pre-VCF (variant calling file)” analyzes, as well as majorapproaches to conduct downstream analysis after VCF filehas been generated: pathway analysis, somatic prediction,copy number estimation, and so forth. Robustness and theability to replicate findings in independent datasets are alsocritical in analyzing high-dimensional data. For analyzingtranscriptomic data, A. E. Berglund et al. proposed a princi-pal component analysis- (PCA-) based technique to revealgene expression signatures that are robust in replicated data-sets. The method can also identify complex signatures fromindependent biological components. Beyond traditional dataanalysis, W. Wei et al. demonstrated that visualization is akey component to translate -omics data into useful informa-tion. The study utilized the GPA (genetic analysis incorporat-ing pleiotropy and annotation) and MDS (multidimensionalscaling) techniques to illustrate genetic relationships betweendifferent human traits/diseases, revealing the underlyingshared genetic architecture.

    3. Data Integration

    Data integration is essential for robust modeling of complexor heterogeneous conditions. By integrating gene expressiondata with prior biological knowledge such as tissue of originor mutation status, V. P. Kamath et al. illustrated enhancedperformance on radiation sensitivity. Their results providea proof of concept on how accounting for biological hetero-geneity can lead to robust modeling of clinical response.

    HindawiInternational Journal of GenomicsVolume 2017, Article ID 5093167, 2 pageshttp://dx.doi.org/10.1155/2017/5093167

    http://dx.doi.org/10.1155/2017/5093167

  • D.M. Rotroff and A. A.Motsinger-Reif reviewed current dataintegration techniques for joint-analysis of multiple -omicsdata and discussed future directions and challenges forapplying these integrative approaches in personalized medi-cine. K. Raja et al. then discussed how researchers can utilizethe large volume of data from the literature to developbiological inference for -omics analysis by providing an in-depth review of text-mining approaches that can be used tosynthesize biomedical or clinical information and alsohighlighted the applications of text-mining in genomic,proteomic, and transcriptomic studies.

    4. Biological and Clinical Inference

    A. Muthiah et al. proposed a novel inference technique,called Module Anchored Network Inference (MANI), toreveal gene-gene relationships and provide inference ondisease mechanism by using time-series gene expression dataon adipocyte differentiation. Instead of utilizing all candidategenes from data, the MANI approach constructs smallnetwork modules, which is shown to outperform other insilico network inference techniques. Many human diseasesare heterogeneous in nature and thus are challenging toprovide accurate diagnosis and monitoring. Using systemiclupus erythematosus as a disease model, E. Zollars et al.applied a genomic technique to develop robust biomarkersignature to better monitor disease activity. The approachutilized gene expression profiles to guide the classificationof patients with different disease activities. In addition togenomic information, T. Nishihori and K. Shain reviewedhow integrating molecular information can advance treat-ment of multiple myeloma in clinical setting using arisk-adapted strategy.

    5. Conclusion

    This special issue presents and discusses technological andmethodological developments in biomedical research leadingto advances in biomedicine through analysis and evaluationof -omicsdata.The research and reviewarticles provide a com-prehensive collection of approaches and studies for translatingbiological information from high-dimensional data to clinicalapplications. With the explosion of big data, we believe thatinnovative techniques, rigorous analytical approaches, andpipelines are keys to provide robust findings that can advancetheir clinical applications.

    Acknowledgments

    The guest editors would like to thank all the authors thatcontributed to this special issue; their scientific findings andinsights have made this special issue possible.

    Lam C. TsoiBethany WolfY. Ann Chen

    2 International Journal of Genomics

  • Review ArticleA Review of Recent Advancement in Integrating OmicsData with Literature Mining towards Biomedical Discoveries

    Kalpana Raja,1 Matthew Patrick,1 Yilin Gao,1

    Desmond Madu,1 Yuyang Yang,1 and Lam C. Tsoi1,2,3

    1Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA2Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA3Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

    Correspondence should be addressed to Lam C. Tsoi; [email protected]

    Received 7 October 2016; Accepted 9 February 2017; Published 26 February 2017

    Academic Editor: Margarita Hadzopoulou-Cladaras

    Copyright © 2017 Kalpana Raja et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    In the past decade, the volume of “omics” data generated by the different high-throughput technologies has expanded exponentially.The managing, storing, and analyzing of this big data have been a great challenge for the researchers, especially when movingtowards the goal of generating testable data-driven hypotheses, which has been the promise of the high-throughput experimentaltechniques. Different bioinformatics approaches have been developed to streamline the downstream analyzes by providingindependent information to interpret and provide biological inference. Text mining (also known as literature mining) is one ofthe commonly used approaches for automated generation of biological knowledge from the huge number of published articles.In this review paper, we discuss the recent advancement in approaches that integrate results from omics data and informationgenerated from text mining approaches to uncover novel biomedical information.

    1. Introduction

    The advances in biotechnology have allowed biomedicalresearch to answer efficiently important biological questionsin the different omics scales: genetics, genomics, transcrip-tomics, epigenomics, proteomics, and metabolomics [1–4].The omics data can characterize the behaviors of cells,tissues, and organs at the molecular level and allow thecomprehensive understanding for the etiology of humandiseases. Among the various omics studies, genetic andgenomic studies are widely adopted in biomedical researchto discover new genes or susceptibility loci associated withdifferent human traits or diseases [5, 6]. Proteomic studyis concerned with the structure, function, and modificationof proteins expressed in a biological system, specifically theposttranscriptional modifications such as phosphorylation,methylation, and acetylation, which lead to transcriptionand translation of the same genome into various typesof proteomes [7, 8]. Epigenomic study has attracted greatattention in the last 5 years. It characterizes the epigenetic

    modifications of the genome and aims to understand theregulations of the gene expression. Transcriptomic study, inturn, enables the genome-wide assessment of gene expressionpatterns in cells and tissues by studying the complete set ofRNA transcriptomes [9]. Finally, metabolomic study charac-terizes the metabolites present in cell, tissue, and body fluidand identifies the fluctuation of these metabolites in variousdisease conditions [10]. The different types of omics studiesaccumulate a huge volume of data through high-throughputsequencing experiments and provide insights towards thecellular and metabolic processes related to disease diagnoses,treatment, and prevention.

    According to the PubMed, over 36,000 research articleshave been published in the past ten years and annotated byat least one of the above “omics” experiments (by using thefollowing search phrase: “(genomics [MeSH] OR proteomics[MeSH] OR metabolomics [MeSH] OR transcriptomics[MeSH]) AND humans [MeSH]”). The interest in omicsstudies has not declined and their applications are evidentfrom the publications in recent years, when compared to

    HindawiInternational Journal of GenomicsVolume 2017, Article ID 6213474, 10 pageshttp://dx.doi.org/10.1155/2017/6213474

    http://dx.doi.org/10.1155/2017/6213474

  • 2 International Journal of Genomics

    only over 10,000 research articles published prior to 2006 byusing the same search phrase. However, the acquired dataraises various significant challenges: (i) the interpretation ofhigh-throughput results; (ii) the translation of biological datato clinical application; (iii) the data handling, storage, andsharing issues; and (iv) the reproducibility when comparingbetween different experiments [11, 12]. Among these, the lastchallenge has been a long-lasting issue, most likely due tothe potential discrepancies in processing and interpreting thehigh-throughput data or due to “cherry-picking” approachto subjectively focus on the components that are indeedfalse positives. The traditional strategies to overcome thesechallenges are to conduct extensive literature search and seekprofessional opinions from domain experts to decipher themechanism and then conduct downstream experiments toverify the findings. However, this has proven to be timeconsuming and subjective and has not been a commonpractice when researchers publish their results from high-throughput experiments. On the other hand, automatedapproaches have gained much interest in recent years toannotate gene functions [13], to identify biomarkers [14], andto explore geneticmutations [15]. Text mining (also known asliteraturemining) is a technique that has been used to retrieveand process research articles from PubMed database and cansummarize biomedical information present across articles. Inmolecular biology, text mining is typically used to retrieverelevant documents, prioritize the documents, extract thebiomedical concepts (e.g., genes, proteins, cell, tissue, andcell-type), and extract the causal relationships between con-cepts [16, 17]. Text mining can significantly decrease thetime and effort required, compared with traditional labor-intensive approaches.

    In this review, we first discuss the various omics tech-niques used in healthcare and summarize the recent advancesin utilizing text mining approaches to facilitate the interpre-tation and translation of these omics data. We then focuson biomedical literature mining and clinical text miningand further describe the challenges involved in integratingthe knowledge from different resources to enhance thebiomedical research. Finally, we explain the recent methodsto integrate omics and biomedical literature mining data inorder to uncover novel biomedical information.

    2. The Study of (Omics)

    Traditionally, “omics” corresponds to the study of four majorbiomolecules: genes, proteins, transcriptomes, and metabo-lites [4]. Since the discovery of DNA [31], much interest hasbeen gained towards understanding the roles of genes andproteins in cellular functions and transduction. Healthcare isconsidered to vary from one individual to another based onhis genome, proteome, transcriptome, and metabolome. Thedigital revolution has paved the way for integrating patientomics data with the findings in literature for the discoveryof novel biomarkers and drug targets [32–34]. Therefore, thestudy of omics has expanded beyond these four major omicsstudies, and Table 1 summarizes the various types of omicsdata applied to biomedical discoveries. The study of omics

    has introduced the realm of big data to biomedicine [35, 36].While the first human genome project took more than adecade to complete and involved $3 billion dollars, the entiregenome can be sequenced and analyzed within hours for∼$1000 now. Thus, biomedical projects are now possible togenerate information at the petabyte (i.e., 1,012 bytes) scale.Nevertheless, the greatest challenge is the large-scale dataanalysis and its integration with clinical data available inpatient electronic health records (EHR) [37].

    Cloud [38] and parallel computing [39] are currently usedin omics research to handle the huge volume of data. Cloudcomputing is described as a network of computers connectedtogether through the Internet for effective processing. It isavailable remotely, through cloud computing providers (e.g.,Microsoft, Google, and Amazon), and researchers have anoption tomake use of it at an affordable cost. Parallel comput-ing speeds up the processing time using the same hardwareand Internet setup. The combined approach of using cloudcomputing and parallel computing together is capable ofprocessing omics data in a feasible time [40, 41]. Otherhigh performance computing platforms include clusters [42],grid computing [43], and graphical processing units [44].Processing omics data and applying bioinformatics modelsto the data require expertise to integrate computational,biological, mathematical, and statistical knowledge.

    3. Text Mining

    PubMed database is a main repository for biomedical lit-erature and contains over 26 million articles. The num-ber of articles being published and indexed by PubMedis increasing exponentially, and therefore text mining hasbecome an attractive (and standard) approach in miningliterature data when comparing with the traditional labor-intensive strategies. Researchers use the textmining approachto tackle information overload, both in biomedical and ingeneral areas of big data collection, because it automates dataretrieval and information extraction from the unstructuredbiomedical texts to reveal novel information [45, 46]. Whileinformation extraction examines the relationships betweenspecific kinds of information contained within or betweendocuments, information retrieval focuses on summarizingdata from the larger units of documents [47]. Anotherautomated approach to deal with unstructured data is NaturalLanguage Processing (NLP). While text mining concentrateson solving a specific problem in a particular domain, NLPattempts to understand the text as a whole [48]. Recently,text mining and NLP have been used to address differentbiological questions in omics research [49].

    3.1. Biomedical Literature Mining. The era of applying textmining approaches to biology and biomedical fields cameinto existence in 1999. It was first applied to the biomedicaldomain for gene expression profiling [50], as well as theextraction and visualization of protein-protein interaction[51]. It emerged as a hybrid discipline from the edgesof three major fields, namely, bioinformatics, informationscience, and computational linguistics. Biomedical literature

  • International Journal of Genomics 3

    Table1:Omicsa

    ndbiom

    edicalapplications.

    Omics

    Stud

    ytopic

    Biom

    edicalapplications†

    Genetics/molecular

    genetics

    Genom

    ics

    Genes

    Gencode,E

    ntrezG

    ene

    Epigenom

    ics

    Epigeneticsm

    odificatio

    nsGeneE

    xpressOmnibu

    sEx

    posomics

    Dise

    ase-causingenvironm

    entalfactors

    Com

    parativ

    eToxicogenom

    icsD

    atabase

    Exom

    ics

    Exon

    sinag

    enom

    eICE—

    ahum

    ansplices

    itesd

    atabase

    ORF

    eomics

    OpenRe

    adingFram

    e(ORF

    )—

    Phenom

    ics

    Phenotypes

    Hum

    anPh

    enotypeO

    ntolog

    yPh

    armacogenom

    ics

    Impactof

    geneso

    nindividu

    al’srespo

    nsetodrugs

    PharmGKB

    Pharmacogenetics

    SNPs

    andtheirimpacton

    pharmacod

    ynam

    icsa

    ndph

    armacokinetics

    PharmGKB

    Toxicogeno

    mics

    Genes

    respon

    seto

    toxics

    ubstances

    Com

    parativ

    eToxicogenom

    icsD

    atabase

    Molecular

    biolog

    y

    Proteomics

    Proteins

    andam

    inoacids

    ProteomicsIdentificatio

    nsDatabase(PR

    IDE)

    Metabolom

    ics

    Metabolites

    HMDB:

    Hum

    anMetabolom

    eDatabase

    Transcrip

    tomics

    Transcrip

    ts(i.e.,

    rRNA,m

    RNA,tRN

    A,and

    microRN

    A)

    Hum

    anTranscrip

    tomeM

    apIono

    mics

    Inorganicb

    iomolecules

    —Kino

    mics

    Proteinkinases

    KinB

    ased

    atabasea

    ndKinW

    ebdatabase

    Metagenom

    ics

    Geneticmaterialfrom

    multip

    leorganism

    sMG-RAST

    Regu

    lomics

    Transcrip

    tionfactorsa

    ndotherb

    iomolecules

    involved

    inther

    egulationof

    gene

    expressio

    nmiRegulom

    e

    Topo

    nomics

    Cell

    andtissues

    tructure

    Medicine

    Trialomics

    Hum

    aninterventio

    naltria

    lsfro

    mclinicaltria

    ls—

    Con

    nectom

    ics

    Structuralandfunctio

    nalcon

    nectivity

    inbrain

    —Interactom

    ics

    Interfe

    rons

    CRED

    O†Th

    elist

    show

    sexamplea

    pplications.

  • 4 International Journal of Genomics

    mining is concerned with the identification and extraction ofbiomedical concepts (e.g., genes, proteins, DNA/RNA, cells,and cell types) and their functional relationships [17]. Themajor tasks include (i) document retrieval and prioritization(gathering and prioritizing the relevant documents); (ii)information extraction (extracting information of interestfrom the retrieved document); (iii) knowledge discovery(discovering new biological event or relationship among thebiomedical concepts); and (iv) knowledge summarization(summarizing the knowledge available across the docu-ments). A brief description of the biomedical literaturemining tasks is listed as follows.

    Biomedical Text Mining Tasks

    Document Retrieval. The process of extracting relevant docu-ments from a large collection is called document retrieval orinformation retrieval [52].The two basic strategies applied arequery-based and document-based retrieval. In query-basedretrieval, documents matching with the user specified queryare retrieved. In document-based retrieval, a ranked list ofdocuments similar to a document of interest is retrieved.

    Document Prioritization.The retrieved documents are usuallyprioritized to get themost relevant document.Many biomed-ical document retrieval systems achieve prioritization basedon certain parameters including journal-relatedmetrics (e.g.,impact factor, citation count) [53] and MeSH index [54, 55]for biomedical articles.The similarity between the documentsis estimated with various similarity measurements (e.g.,Jaccard similarity, cosine similarity) [56].

    Information Extraction. This task aims to extract and presentthe information in a structured format. Concept extractionand relation/event extraction are the two major componentsof information extraction [57, 58]. While concept extractionautomatically identifies the biomedical concepts present inthe articles, relation/event extraction is used to predictthe relationship or biological event (e.g., phosphorylation)between the concepts [59, 60].

    Knowledge Discovery. It is a nontrivial process to discovernovel and potentially useful biological information fromthe structured text obtained from information extraction.Knowledge discovery uses techniques from a wide range ofdisciplines such as artificial intelligence, machine learning,pattern recognition, data mining, and statistics [61]. Bothinformation extraction and knowledge discovery find theirapplication in database curation [62, 63] and pathway con-struction [64, 65].

    Knowledge Summarization. The purpose of knowledge sum-marization is to generate information for a given topic fromone or multiple documents. The approach aims to reducethe source text to express the most important key pointsthrough content reduction selection and/or generalization[66]. Although knowledge summarization helps to managethe information overload, the state of the art is still opento research to develop more sophisticated approaches thatincrease the likelihood of identifying the information.

    Hypothesis Generation. An important task of text mining ishypothesis generation to predict unknown biomedical factsfrom biomedical articles. These hypotheses are useful indesigning experiments or explaining existing experimentalresults [67].

    Conventional text mining approaches process PubMedabstracts rather than the full-text articles and fail to minethe information not in abstracts. Recently, text mining fromthe full-text articles is gaining more interest [59]. However,it involves many challenges: (1) the availability of full-textarticles is limited (4 million full-text articles in PubMedCentral versus 26 million abstracts in PubMed); (2) textmining within tables, figures, and equations is complicated;and (3) information redundancy within the articles. Anautomated text mining system is generally evaluated usinga standard corpus (Table 2). However, the availability ofstandard corpora in biomedical domain is limited becauseits generation is expensive, time consuming, and requiresdomain experts. In general, a gold standard is developedwithin the research groupswhen the standard corpora are notavailable, but mostly not available to other researchers. Thetextmining systems are commonly evaluated using precision,recall, and f-score. Precision is defined as the relevanceaccuracy, recall is defined as the retrieval accuracy, and f-score is defined as the harmonic mean of precision and recall[56].

    3.2. Clinical Text Mining. Electronic health records, dis-charge summaries, and clinical narratives of patients arerich in information that could be useful for improving thehealthcare. In addition, the information is also availablefrom the transcription of dictations, direct entry by clini-cians/physicians, or speech recognition software. The encod-ing of structural information from the clinical resources isuseful to clinicians and researchers. For example, automatedhigh-throughput clinical applications can be developed tosupport clinicians’ information needs [68]. However, manualencoding is expensive and limited to primary and secondarydiagnoses. Clinical textmining, also known as clinicalNLPorMedical Language Processing (or simply MLP), is suggestedas a potential technology by Institute of Medicine for miningclinical resources. The tasks described above in biomedicalliterature mining are applicable to clinical text mining andinclude additional subtasks [69]: (i) negation recognition(e.g., “patient denies on developing rashes”), (ii) temporalextraction (e.g., “small bumps noticed last year”), and (iii)patient-event relationship (e.g., “patient mother had arthri-tis”).

    The modern healthcare relies on big data analytics forintegrating, organizing, and utilizing different pharmacolog-ical or clinical information. A hybrid approach to combinepatient genomic data and electronic health record infor-mation is expanding as the future vision of healthcare.The omics data has become an emerging tool for diagno-sis/clinical investigations of common and rare diseases andhelps in clinical decision making (i.e., selecting the bestpossible treatments for patients). Genome-Wide Association

  • International Journal of Genomics 5

    Table2:Standard

    corporafor

    omicsd

    omain.

    Corpu

    sText

    miningevaluatio

    ntask

    Briefintrodu

    ction

    JNLP

    BA(Jo

    intW

    orksho

    pon

    NLP

    inBiom

    edicinea

    ndIts

    Applications)[18]

    Gene/proteinconceptextraction

    Thec

    orpu

    scon

    sistsof

    2,00

    0Pu

    bMed

    abstractsa

    straining

    dataand40

    4Pu

    bMed

    abstractsa

    stestd

    ata.

    BioC

    reAtivE2004

    Task

    1Adataset[19]

    Gene/proteinconceptextraction

    Thec

    orpu

    scon

    sistsof

    15,000

    PubM

    edsentencesa

    straining

    dataand5,00

    0Pu

    bMed

    sentencesa

    stestd

    ata.

    BioC

    reAtivE2GeneM

    entio

    n(G

    M)d

    ataset[20]

    Gene/proteinconceptextraction

    Thec

    orpu

    scon

    sistsof

    15,000

    PubM

    edsentencesa

    straining

    dataand5,00

    0Pu

    bMed

    sentencesa

    stestd

    ata.

    AIM

    ED[21]

    Protein-proteininteraction

    Thec

    orpu

    scon

    sistsof

    225Pu

    bMed

    abstractsthat

    contain1,9

    87sentencesw

    ith4,075proteinmentio

    ns.

    HPR

    D50

    (Hum

    anProtein

    ReferenceD

    atabase)[22]

    Protein-proteininteraction

    Thec

    orpu

    scon

    sistsof

    sentencesw

    ithprotein-protein

    interactionfro

    m50

    PubM

    edabstracts.

    BioInfer

    (Bio

    Inform

    ation

    Extractio

    nRe

    source)[23]

    Protein,gene,and

    RNA

    relatio

    nships

    Thec

    orpu

    scon

    sistsof

    1100sentencesa

    nnotated

    with

    conceptn

    ames,relationships,and

    syntactic

    depend

    encies.

    IEPA

    (InteractionEx

    tractio

    nPerfo

    rmance

    Assessment)[24]

    Protein-proteininteraction

    Thec

    orpu

    scon

    sistsof

    morethan200Pu

    bMed

    sentencesa

    nnotated

    with

    protein-proteininteraction.

    BioC

    reAtivE2.5Elsevier

    Corpu

    s[25]

    Protein-proteininteraction

    Thec

    orpu

    scon

    sistsof

    61Pu

    bMed

    articlesa

    straining

    dataand62

    PubM

    edarticlesa

    stestd

    ata.

    BC4G

    OCorpu

    s[26]

    Geneo

    ntolog

    yTh

    ecorpu

    scon

    sistsof

    1356

    distinctGOterm

    sfrom

    200

    PubM

    edarticles.

    GRE

    CCorpu

    s[27]

    Gener

    egulationandgene

    expressio

    nevents

    Thec

    orpu

    scon

    sistsof

    240Pu

    bMed

    abstractsw

    ithanno

    tatio

    nson

    gene

    regu

    latio

    nandgene

    expressio

    nevents.

    GET

    M[28]

    Genee

    xpressionevents

    Thec

    orpu

    scon

    sistsof

    150Pu

    bMed

    abstractsw

    ithanno

    tatio

    nforg

    enee

    xpressionevents.

    AnE

    M[29]

    Tissue,cell,developing

    anatom

    icalstructure,cellu

    lar

    compo

    nent

    Thec

    orpu

    scon

    sistsof

    500Pu

    bMed

    sentencesw

    ithanno

    tatio

    nson

    varie

    tyof

    biom

    edicalconcepts.

    CellFind

    erCorpu

    s[30]

    Anatomicalparts,celllin

    es,cell

    types,species,andcell

    compo

    nents

    Thec

    orpu

    scon

    sistsof

    anno

    tatio

    nsfro

    m10

    full-text

    PubM

    edarticles.

  • 6 International Journal of Genomics

    Study (GWAS), also known as Whole Genome AssociationStudy (WGAS), is a relatively new approach for identifyinggenes (i.e., loci associated with human traits) through rapidscanning of markers across whole DNA or genome [70].GWAS has been applied also to cancer research for drugrepositioning [71], prioritizing susceptible genes in Crohn’sdisease [72], and analyzing the human variants in the areaof precision medicine [73]. As an example, the MichiganGenomics Initiatives (MGI) at theUniversity ofMichigan hasdeveloped an institutional based DNA and genetics reposi-tory combined with patient phenotype. The project aims tobring awareness to each patient/participant about the diseasedevelopment and response to treatments for better healthand wellness. The current studies at MGI include analgesicsoutcome study (AOS), understanding opioid use in chronicpain patients, a pivotal study on high-frequency nerve blockfor postamputation pain, Michigan body map (MBM), andpositive piggy bag (https://www.michigangenomics.org/).

    Clinical text mining faces the following specific chal-lenges: (1) access to patient EHR requires permission fromInstitutional Review Board (IRB); (2) personal details ofthe patients should be deidentified; (3) mining approachesdepend on the types of clinical documents (e.g., EHR,discharge summary, medical billing, and clinical narratives);(4) mining of dosage information, different types of for-mulations, and temporal information is demanded; and (5)spelling mistakes and grammatical errors are common inclinical text [69]. The state of the art for both biomedicalliterature mining and clinical text mining is still open withmany challenges and requires more sophisticated and robustapproaches.

    4. Role of Text Mining in Omics Study

    Relationship between concepts of the same kind (e.g., gene-gene) or different kind (e.g., gene-disease) is commonlyknown as “event” [74].The events are useful to identify manyclinical facts such as disease onset and response to drugtreatment. Overwhelming of biomedical articles from omicsresearch has accumulated abundance of information andrequires advanced event extraction systems to support thecomplexity of available information and coverage of varietiesof biomedical subdomains [16]. Text mining approaches donot replace the manual curation of biomedical informationbut support speeding up the process by several-fold [75,76]. In this section we describe the various text miningapproaches developed for mining omics related information.

    4.1. Genomics and Text Mining. In the current era of genom-ics, text mining plays an important role in mining gene-gene interactions [77, 78] and other gene involved interac-tions (e.g., gene-chemical, gene-disease) [79, 80] to supportintegrative analysis of gene expression [81, 82], pathway con-struction [83, 84], ontology development [85], and databaseannotation [62, 86, 87].

    Genes encode proteins and proteins enroll in variousbiological functions by interacting with other proteins. Thisencoding process is defined in two steps: transcription(i.e., DNA to RNA) and translation (RNA to protein).

    Many cellular processes are regulated by microRNA throughmRNA degradation and suppression of gene expressionsuch that the protein synthesis is interrupted. This is thefundamental of genomics. In genomics, gene function isassessed from the involvement of genes/proteins in biochem-ical pathways. The functional genomics is a revolutionaryarea in text mining where the gene/protein mentions in thebiomedical articles and their relationship are considered to beimportant. Furthermore, gene and protein names are highlycomplex and text mining has contributed to their recognitionin the unstructured text [57, 58].

    Different text mining implementations for exploring thefinding of genome research have been developed in thepast decade. miRTex is a text mining system developedfor mining experimentally validated microRNA gene targetsfrom PubMed articles. The system has been successfullyimplemented to identify the Triple Negative Breast Cancerrelated genes that are regulated by microRNAs [81]. Moresophisticated approaches integrate gene expressions frommicroarray experiments, biomedical data extracted by textmining, and gene interaction data to predict gene-based drugindications [82]. A similar approach [87] attempts to supportmanual curation of links between biological databases suchas Gene Expression Omnibus (GEO) and PubMed database.Another approach [88] combines text mining data withmicroarray data for discovering disease-gene association byusing unsupervised clustering. The gene-drug interactioninformation extracted by text mining is used to predict thedrug-drug interaction [89]. Above all, the researchers haveattempted to use text mining for annotating genome functionwith gene ontology [90]. Thus, text mining and genomicstogether uncover much biomedical information that waspreviously unknown.

    4.2. Proteomics and Text Mining. Protein-protein interactionis important to explore the mechanism involved in biologicalprocesses and onset of diseases [91]. Intact [92], BIND [93],MIND [94], and DIP [95] are the major databases availablefor protein-protein interaction.These databases are manuallycurated by the domain experts, but a larger portion of infor-mation is still available only in the biomedical literature. Textmining provides a bridge to cover the gap existing betweenthemanual curation and information hidden in the literature.The approaches to extract protein-protein interaction rangefrom simple rule-based systems and cooccurrence systemsto more sophisticated NLP methods [60] and machinelearning systems [96]. Apart fromprotein-protein interactionextraction systems, text mining also provides automatedapproaches for extracting posttranslational modification ofproteins such as protein phosphorylation [59].

    4.3. Transcriptomics, Metabolomics, and Text Mining. Textmining approaches for transcriptomics andmetabolomics arelimited. One major fact is that these two areas of genomicsare comparatively new when compared to genomics and pro-teomics. A recent study compares the metagenome charac-teristics of healthy individuals with autism patients to analyzethe enzymes involved [97].The computational approach usestext mining for genomics and metabolomics information

    https://www.michigangenomics.org/

  • International Journal of Genomics 7

    extraction. A web-based tool called 3Omics is availablefor integrating, comparing, analyzing, and visualizing datafrom transcriptomics, metabolomics, and proteomics [98].Another tool called Babelomics integrates transcriptomics,proteomics, and genomics data to uncover the underlyingfunction profiles [99].Thus, a wide variety of hidden biomed-ical information within the omics data are extracted andpredicted through text mining.

    5. Conclusion

    In this review, we summarized the current state of the art inomics research and contribution of text mining approachesto uncover the omics related biomedical information hid-den within the published articles. We discussed the coreconcepts of omics and the challenges involved in storingand analyzing the huge volume of omics data generatedfrom high-throughput experiments. We also highlighted theuse of computer techniques such as parallel processing andcloud computing to manage omics data and elaborated ontext mining approaches for biomedical literature and clinicaltext with emphasis on omics. While the omics approach isemerging to be commonly used practice for basic scienceor clinical diagnosis technique, it is imminent to note thatdata interpretation and translation is the bottleneck. Theadvances in textmining can be useful to resolve the challengeswith the omics data and further support in novel biomedicaldiscoveries.

    Competing Interests

    The authors declare that there is no conflict of interestsregarding the publication of this paper.

    Acknowledgments

    The authors acknowledge the support from the Under-graduate Research Opportunity Program (UROP) from theUniversity of Michigan, the Dermatology Foundation, theArthritis National Research Foundation, and the NationalPsoriasis Foundation.

    References

    [1] E. Morra, M. Lazzarino, E. P. Alessandrino et al., “Central ner-vous system (CNS) leukemia: the role of high dose cytarabine(HDAra-C),” Bone Marrow Transplantation, vol. 4, supplement1, pp. 101–103, 1989.

    [2] D. B. Kell, “The virtual human: towards a global systemsbiology ofmultiscale, distributed biochemical networkmodels,”IUBMB Life, vol. 59, no. 11, pp. 689–695, 2007.

    [3] H. V.Westerhoff and B. O. Palsson, “The evolution of molecularbiology into systems biology,” Nature Biotechnology, vol. 22, no.10, pp. 1249–1252, 2004.

    [4] R. P. Horgan and L. C. Kenny, “‘Omic’ technologies: genomics,transcriptomics, proteomics and metabolomics,” The Obstetri-cian & Gynaecologist, vol. 13, no. 3, pp. 189–195, 2011.

    [5] L. C. Tsoi, S. L. Spain, E. Ellinghaus et al., “Enhanced meta-analysis and replication studies identify five new psoriasis

    susceptibility loci,” Nature Communications, vol. 6, Article ID7001, 2015.

    [6] D. Bertrand, K. R. E. Chng, F. G. H. Sherbaf et al., “Patient-specific driver gene prediction and risk assessment throughintegrated network analysis of cancer omics profiles,” Nucleicacids research, vol. 43, no. 7, p. e44, 2015.

    [7] P. James, “Protein identification in the post-genome era: therapid rise of proteomics,” Quarterly Reviews of Biophysics, vol.30, no. 4, pp. 279–331, 1997.

    [8] G. A. Khoury, R. C. Baliban, andC.A. Floudas, “Proteome-widepost-translational modification statistics: frequency analysisand curation of the swiss-prot database,” Scientific Reports, vol.1, article 90, 2011.

    [9] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B.Wold, “Mapping and quantifying mammalian transcriptomesby RNA-Seq,” Nature Methods, vol. 5, no. 7, pp. 621–628, 2008.

    [10] J. K. Nicholson, J. C. Lindon, and E. Holmes, ““Metabonomics”:understanding the metabolic responses of living systems topathophysiological stimuli viamultivariate statistical analysis ofbiological NMR spectroscopic data,” Xenobiotica, vol. 29, no. 11,pp. 1181–1189, 1999.

    [11] K. Shoenbill, N. Fost, U. Tachinardi, and E. A. Mendonca,“Genetic data and electronic health records: a discussion ofethical, logistical and technological considerations,” Journal ofthe American Medical Informatics Association, vol. 21, no. 1, pp.171–180, 2014.

    [12] G. Poste, “Bring on the biomarkers,” Nature, vol. 469, no. 7329,pp. 156–157, 2011.

    [13] Q. Lu, R. L. Powles, Q.Wang, B. J. He, and H. Zhao, “Integrativetissue-specific functional annotations in the human genomeprovide novel insights on many complex traits and improvesignal prioritization in genome wide association studies,” PLoSGenetics, vol. 12, no. 4, Article ID e1005947, 2016.

    [14] L. Puchades-Carrasco, M. Palomino-Schätzlein, C. Pérez-Rambla, and A. Pineda-Lucena, “Bioinformatics tools forthe analysis of NMR metabolomics studies focused on theidentification of clinically relevant biomarkers,” Briefings inBioinformatics, vol. 17, no. 3, pp. 541–552, 2016.

    [15] S. A. Forbes, D. Beare, P. Gunasekaran et al., “COSMIC: explor-ing the world’s knowledge of somatic mutations in humancancer,” Nucleic Acids Research, vol. 43, pp. D805–D811, 2015.

    [16] S. Ananiadou, P. Thompson, R. Nawaz, J. McNaught, and D.B. Kell, “Event-based text mining for biology and functionalgenomics,” Briefings in Functional Genomics, vol. 14, no. 3, pp.213–230, 2015.

    [17] M. Krallinger and A. Valencia, “Text-mining and information-retrieval services formolecular biology,”Genome Biology, vol. 6,no. 7, article no. 224, 2005.

    [18] H. K. Lee, A. K. Hsu, J. Sajdak, J. Qin, and P. Pavlidis,“Coexpression analysis of human genes acrossmanymicroarraydata sets,” Genome Research, vol. 14, no. 6, pp. 1085–1094, 2004.

    [19] A. Yeh, A. Morgan, M. Colosimo, and L. Hirschman, “BioCre-AtIvE task 1A: genemention finding evaluation,” BMC Bioinfor-matics, vol. 6, no. 1, article no. S2, 2005.

    [20] A. Vlachos, “Tackling the BioCreative2 gene mention task withconditional randomfields and syntactic parsing,” in Proceedingsof the 2nd BioCreative Challenge Evaluation Workshop, Madrid,Spain, April 2007.

    [21] R. Bunescu, R. Ge, R. J. Kate et al., “Comparative experimentson learning information extractors for proteins and theirinteractions,”Artificial Intelligence inMedicine, vol. 33, no. 2, pp.139–155, 2005.

  • 8 International Journal of Genomics

    [22] K. Fundel, R. Küffner, and R. Zimmer, “RelEx—relation extrac-tion using dependency parse trees,” Bioinformatics, vol. 23, no.3, pp. 365–371, 2007.

    [23] S. Pyysalo, F. Ginter, J. Heimonen et al., “BioInfer: a corpusfor information extraction in the biomedical domain,” BMCBioinformatics, vol. 8, article 50, 2007.

    [24] J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, “MiningMEDLINE: abstracts, sentences, or phrases?” Pacific Sympo-sium on Biocomputing. Pacific Symposium on Biocomputing, vol.7, pp. 326–337, 2002.

    [25] F. Leitner, M. Krallinger, G. Cesareni, and A. Valencia, “TheFEBS letters SDA corpus: a collection of protein interactionarticles with high quality annotations for the BioCreative II.5online challenge and the text mining community,” FEBS Letters,vol. 584, no. 19, pp. 4129–4130, 2010.

    [26] K. Van Auken, M. L. Schaeffer, P. McQuilton et al., “BC4GO: afull-text corpus for the BioCreative IV GO task,” Database, vol.2014, Article ID bau074, 2014.

    [27] P. Thompson, S. A. Iqbal, J. McNaught, and S. Ananiadou,“Construction of an annotated corpus to support biomedicalinformation extraction,” BMC Bioinformatics, vol. 10, article349, 2009.

    [28] M. Gerner, G. Nenadic, and C. M. Bergman, “An explorationof mining gene expression mentions and their anatomical loca-tions from biomedical text,” in Proceedings of the Workshop onBiomedical Natural Language Processing, pp. 72–80, Associationfor Computational Linguistics, Uppsala, Sweden, July 2010.

    [29] T. Ohta, S. Pyysalo, J. Tsujii, and S. Ananiadou, “Open-domain anatomical entity mention detection,” in Proceedingsof the Workshop on Detecting Structure in Scholarly Discourse,Association for Computational Linguistics, Jeju, Korea, July2012.

    [30] M. Neves, E. Damaschun, A. Kurtz, and U. Leser, “Annotatingand evaluating text for stem cell research,” in Proceedings ofthe 3rd Workshop on Building and Evaluation Resources forBiomedical Text Mining (BioTxtM ’12) at Language Resourcesand Evaluation (LREC), Istanbul, Turkey, 2012.

    [31] L. A. Pray, “Discovery of DNA structure and function: Watsonand Crick,” Nature Education, vol. 1, no. 1, article 100, 2008.

    [32] N. T. Issa, S. W. Byers, and S. Dakshanamurthy, “Big data: thenext frontier for innovation in therapeutics and healthcare,”Expert Review of Clinical Pharmacology, vol. 7, no. 3, pp. 293–298, 2014.

    [33] S. Jiang, T. E. Hinchliffe, and T. Wu, “Biomarkers of anautoimmune skin disease-psoriasis,” Genomics, Proteomics andBioinformatics, vol. 13, no. 4, pp. 224–233, 2015.

    [34] A. Tebani, C. Afonso, S. Marret, and S. Bekri, “Omics-basedstrategies in precision medicine: toward a paradigm shiftin inborn errors of metabolism investigations,” InternationalJournal of Molecular Sciences, vol. 17, no. 9, p. 1555, 2016.

    [35] J. M. Rothberg, W. Hinz, T. M. Rearick et al., “An integratedsemiconductor device enabling non-optical genome sequenc-ing,” Nature, vol. 475, no. 7356, pp. 348–352, 2011.

    [36] J. Clarke, H.-C. Wu, L. Jayasinghe, A. Patel, S. Reid, and H.Bayley, “Continuous base identification for single-moleculenanopore DNA sequencing,”Nature Nanotechnology, vol. 4, no.4, pp. 265–270, 2009.

    [37] V. Canuel, B. Rance, P. Avillach, P. Degoulet, and A. Burgun,“Translational research platforms integrating clinical and omicsdata: a review of publicly available solutions,” Briefings inBioinformatics, vol. 16, no. 2, pp. 280–290, 2015.

    [38] L. Griebel, H. Prokosch, F. Köpcke et al., “A scoping review ofcloud computing in healthcare,” BMC Medical Informatics andDecision Making, vol. 15, article 17, 2015.

    [39] K. Ocaña and D. De Oliveira, “Parallel computing in genomicresearch: advances and applications,”Advances andApplicationsin Bioinformatics and Chemistry, vol. 8, pp. 23–35, 2015.

    [40] D. P. Wall, P. Kudtarkar, V. A. Fusaro, R. Pivovarov, P. Patil, andP. J. Tonellato, “Cloud computing for comparative genomics,”BMC Bioinformatics, vol. 11, article no. 259, 2010.

    [41] M. Armbrust, A. Fox, R. Griffith et al., “A view of cloudcomputing,” Communications of the ACM, vol. 53, no. 4, pp. 50–58, 2010.

    [42] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, andI. Stoica, “Spark: cluster ComSpark: cluster computing withworking sets,” in Proceedings of the 2nd USENIX Conference onHot Topics in Cloud Computing, p. 10, Boston, Mass, USA, June2010.

    [43] M. Baker, R. Buyya, andD. Laforenza, “Grids and grid technolo-gies for wide-area distributed computing,” Software—Practice&Experience, vol. 32, no. 15, pp. 1437–1466, 2002.

    [44] I. S. Ufimtsev and T. J. Martinez, “Quantum chemistry ongraphical processing units. 2. Direct self-consistent-field imple-mentation,” Journal of Chemical Theory and Computation, vol.5, no. 4, pp. 1004–1015, 2009.

    [45] M. A. Hearst, “Untangling text data mining,” in Proceedings ofthe the 37th annual meeting of the Association for ComputationalLinguistics (ACL ’99), pp. 3–10, College Park, Maryland, June1999.

    [46] K. B. Cohen and L. Hunter, “Natural language processing andsystems biology,” in Artificial Intelligence Methods and Tools forSystems Biology, vol. 5 of Computational Biology, pp. 147–173,Springer, Dordrecht, The Netherlands, 2004.

    [47] M. Weeber, H. Klein, A. R. Aronson, J. G. Mork, L. T. deJong-van den Berg, and R. Vos, “Text-based discovery inbiomedicine: the architecture of the DAD-system,” Proceedingsof the AMIA Symposium, pp. 903–907, 2000.

    [48] A. S. Yeh, L. Hirschman, and A. A. Morgan, “Evaluation of textdata mining for database curation: lessons learned from theKDD Challenge Cup,” Bioinformatics, vol. 19, supplement 1, pp.i331–i339, 2003.

    [49] Y. Liu, Y. Liang, and D. Wishart, “PolySearch2: a significantlyimproved text-mining system for discovering associationsbetween human diseases, genes, drugs, metabolites, toxins andmore,” Nucleic Acids Research, vol. 43, no. 1, pp. W535–W542,2015.

    [50] L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter, andJ. N. Weinstein, “MedMiner: an Internet text-mining tool forbiomedical information, with application to gene expressionprofiling,” BioTechniques, vol. 27, no. 6, pp. 1210–1217, 1999.

    [51] C. Blaschke, M. A. Andrade, C. Ouzounis, and A. Valencia,“Automatic extraction of biological information from scientifictext: protein-protein interactions,” in Proceedings of the 7thInternational Conference on Intelligent Systems for MolecularBiology, pp. 60–67, AAAI, Heidelberg, Germany, August 1999.

    [52] R. Baeza-Yates and B. Ribeiro-Neto, Modern InformationRetrieval: The Concepts and Technology behind Search, ACMPress, 2nd edition, 2011.

    [53] Y. Lin, W. Li, K. Chen, and Y. Liu, “A document clustering andranking system for exploring MEDLINE citations,” Journal ofthe American Medical Informatics Association, vol. 14, no. 5, pp.651–661, 2007.

  • International Journal of Genomics 9

    [54] S. J. Darmoni, L. F. Soualmia, C. Letord et al., “Improvinginformation retrieval using medical subject headings concepts:a test case on rare and chronic diseases,” Journal of the MedicalLibrary Association, vol. 100, no. 3, pp. 176–183, 2012.

    [55] M. Petrova, P. Sutcliffe, K. W. M. Fulford, and J. Dale, “Searchterms and a validated brief search filter to retrieve publicationson health-related values in Medline: a word frequency analysisstudy,” Journal of the American Medical Informatics Association,vol. 19, no. 3, pp. 479–488, 2012.

    [56] C. D. Manning, P. Raghavan, and H. Schuetze, Introduction toInformation Retrieval, Cambridge University Press, 2008.

    [57] R. Leaman andG.Gonzalez, “BANNER: an executable survey ofadvances in biomedical named entity recognition,” in Proceed-ings of the 13th Pacific Symposium on Biocomputing (PSB ’08),pp. 652–663, Kohala Coast, Hawaii, USA, January 2008.

    [58] K. Raja, S. Subramani, and J. Natarajan, “A hybrid named entitytagger for tagging human proteins/genes,” International Journalof Data Mining and Bioinformatics, vol. 10, no. 3, pp. 315–328,2014.

    [59] M. Torii, C. N. Arighi, G. Li, Q. Wang, C. H. Wu, and K. Vijay-Shanker, “RLIMS-P 2.0: a generalizable rule-based informationextraction system for literature mining of protein phosphory-lation information,” IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, vol. 12, no. 1, pp. 17–29, 2015.

    [60] K. Raja, S. Subramani, and J. Natarajan, “PPInterFinder—amining tool for extracting causal relations on human proteinsfrom literature,”Database (Oxford), vol. 2013, Article ID bas052,2013.

    [61] J. Natarajan, D. Berrar, C. J. Hack, andW.Dubitzky, “Knowledgediscovery in biology and biotechnology texts: a review oftechniques, evaluation strategies, and applications,” CriticalReviews in Biotechnology, vol. 25, no. 1-2, pp. 31–52, 2005.

    [62] K. E. Ravikumar, K. B. Wagholikar, D. Li, J.-P. Kocher, andH. Liu, “Text mining facilitates database curation—extractionof mutation-disease associations from Bio-medical literature,”BMC Bioinformatics, vol. 16, no. 1, article 185, 2015.

    [63] S. Matos, D. Campos, R. Pinho et al., “Mining clinical attributesof genomic variants through assisted literature curation inEgas,” Database (Oxford), vol. 2016, Article ID baw096, 2016.

    [64] S. Subramani, R. Kalpana, P. M. Monickaraj, and J. Natarajan,“HPIminer: a text mining system for building and visualizinghuman protein interaction networks and pathways,” Journal ofBiomedical Informatics, vol. 54, pp. 121–131, 2015.

    [65] J. Czarnecki, I. Nobeli, A. M. Smith, and A. J. Shepherd, “A text-mining system for extracting metabolic reactions from full-textarticles,” BMC Bioinformatics, vol. 13, no. 1, article 172, 2012.

    [66] R. Mishra, J. Bian, M. Fiszman et al., “Text summarization inthe biomedical domain: a systematic review of recent research,”Journal of Biomedical Informatics, vol. 52, pp. 457–467, 2014.

    [67] F. Zhu, P. Patumcharoenpol, C. Zhang et al., “Biomedical textmining and its applications in cancer research,” Journal ofBiomedical Informatics, vol. 46, no. 2, pp. 200–211, 2013.

    [68] S. M. Meystre, G. K. Savova, K. C. Kipper-Schuler, and J. F.Hurdle, “Extracting information from textual documents in theelectronic health record: a review of recent research,” Yearbookof medical informatics, pp. 128–144, 2008.

    [69] K. Raja and S. R. Jonnalagadda, “Natural language processingand data mining for clinical text,” inHealthcare Data Analytics,C. K. Reddy and C. C. Aggarwal, Eds., pp. 219–250, CRC Press,2015.

    [70] D. Welter, J. MacArthur, J. Morales et al., “The NHGRI GWASCatalog, a curated resource of SNP-trait associations,” NucleicAcids Research, vol. 42, no. 1, pp. D1001–D1006, 2014.

    [71] J. Zhang, K. Jiang, L. Lv et al., “Use of genome-wide associationstudies for cancer research and drug repositioning,” PLoS ONE,vol. 10, no. 3, Article ID e0116477, 2015.

    [72] D. Muraro, D. A. Lauffenburger, and A. Simmons, “Prioriti-sation and network analysis of Crohn’s disease susceptibilitygenes,” PLoS ONE, vol. 9, no. 9, Article ID e108624, 2014.

    [73] T. A. Peterson, E. Doughty, andM.G. Kann, “Towards precisionmedicine: advances in computational approaches for the anal-ysis of human variants,” Journal of Molecular Biology, vol. 425,no. 21, pp. 4047–4063, 2013.

    [74] J.-D. Kim, N. Nguyen, Y. Wang, J. Tsujii, T. Takagi, and A.Yonezawa, “The genia event and protein coreference tasks ofthe BioNLP shared task 2011,” BMC bioinformatics, vol. 13,supplement 11, p. S1, 2012.

    [75] T. C. Wiegers, A. P. Davis, K. B. Cohen, L. Hirschman, and C.J. Mattingly, “Text mining and manual curation of chemical-gene-disease networks for the Comparative ToxicogenomicsDatabase (CTD),” BMC Bioinformatics, vol. 10, article 1471, p.326, 2009.

    [76] L. Hirschman, G. A. P. C. Burns, M. Krallinger et al., “Textmining for the biocuration workflow,” Database, vol. 2012,Article ID bas020, 2012.

    [77] E. K. Mallory, C. Zhang, C. Ré, and R. B. Altman, “Large-scaleextraction of gene interactions from full-text literature usingDeepDive,” Bioinformatics, vol. 32, no. 1, pp. 106–113, 2015.

    [78] J. Hur, A. Özgür, Z. Xiang, and Y. He, “Development andapplication of an interaction network ontology for literaturemining of vaccine-associated gene-gene interactions,” Journalof Biomedical Semantics, vol. 6, no. 1, article no. 2, 2015.

    [79] A. P. Davis, C. J. Grondin, R. J. Johnson et al., “The comparativetoxicogenomics database: update 2017,” Nucleic Acids Research,vol. 45, 2017.

    [80] S. Pletscher-Frankild, A. Pallejà, K. Tsafou, J. X. Binder, and L. J.Jensen, “DISEASES: textmining and data integration of disease-gene associations,”Methods, vol. 74, pp. 83–89, 2015.

    [81] G. Li, K. E. Ross, C. N. Arighi, Y. Peng, C. H. Wu, and K. Vijay-Shanker, “miRTex: a textmining system formirna-gene relationextraction,” PLoS Computational Biology, vol. 11, no. 9, ArticleID e1004391, 2015.

    [82] A. Qabaja, T. Jarada, A. Elsheikh, and R. Alhajj, “Prediction ofgene-based drug indications using compendia of public geneexpression data and PubMed abstracts,” Journal of Bioinfor-matics and Computational Biology, vol. 12, no. 3, Article ID14500073, 2014.

    [83] E. Donnard, A. Barbosa-Silva, R. L. M. Guedes et al.,“Preimplantation development regulatory pathway construc-tion through a text-mining approach,” BMC Genomics, vol. 12,no. 4, article S3, 2011.

    [84] R. Lehmann, L. Childs, P.Thomas et al., “Assembly of a compre-hensive regulatory network for themammalian circadian clock:a bioinformatics approach,” PLoS ONE, vol. 10, no. 5, Article IDe0126283, 2015.

    [85] H. Chen, D. Han, Y. Dai, and L. Zhao, “Design of automaticextraction algorithm of knowledge points for MOOCs,” Com-putational Intelligence and Neuroscience, vol. 2015, Article ID123028, 10 pages, 2015.

    [86] R. Weikard, F. Hadlich, and C. Kuehn, “Identification of noveltranscripts and noncoding RNAs in bovine skin by deep next

  • 10 International Journal of Genomics

    generation sequencing,”BMCGenomics, vol. 14, no. 1, article no.789, 2013.

    [87] A. Neveol, W. J. Wilbur, and Z. Lu, “Improving links betweenliterature and biological data with text mining: a case studywith GEO, PDB and MEDLINE,” Database, vol. 2012, ArticleID bas026, 2012.

    [88] A. Faro, D. Giordano, and C. Spampinato, “Combining liter-ature text mining with microarray data: advances for systembiology modeling,” Briefings in Bioinformatics, vol. 13, no. 1,Article ID bbr018, pp. 61–82, 2012.

    [89] B. Percha, Y. Garten, and R. B. Altman, “Discovery and explana-tion of drug-drug interactions via textmining,” inProceedings ofthe 17th Pacific Symposium on Biocomputing (PSB ’12), pp. 410–421, Kohala Coast, Hawaii, USA, January 2012.

    [90] J. M. Daley, H. Niu, A. S. Miller, and P. Sung, “Biochemicalmechanism of DSB end resection and its regulation,” DNARepair, vol. 32, pp. 66–74, 2015.

    [91] M. G. Kann, “Protein interactions and disease: computationalapproaches to uncover the etiology of diseases,” Briefings inBioinformatics, vol. 8, no. 5, pp. 333–346, 2007.

    [92] S. Kerrien, Y. Alam-Faruque, B. Aranda et al., “IntAct—opensource resource for molecular interaction data,” Nucleic AcidsResearch, vol. 35, no. 1, pp. D561–D565, 2007.

    [93] G. D. Bader, I. Donaldson, C. Wolting, B. F. F. Ouellette,T. Pawson, and C. W. V. Hogue, “BIND—The BiomolecularInteraction Network Database,” Nucleic Acids Research, vol. 29,no. 1, pp. 242–245, 2001.

    [94] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello,M. Helmer-Citterich, and G. Cesareni, “MINT: a molecularINTeraction database,” FEBS Letters, vol. 513, no. 1, pp. 135–140,2002.

    [95] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie,and D. Eisenberg, “The database of interacting proteins: 2004update,” Nucleic Acids Research, vol. 32, pp. D449–D451, 2004.

    [96] Q.-C. Bui, S. Katrenko, and P. M. A. Sloot, “A hybrid approachto extract protein-protein interactions,” Bioinformatics, vol. 27,no. 2, pp. 259–265, 2011.

    [97] C. Heberling and P. Dhurjati, “Novel systems modelingmethodology in comparativemicrobialmetabolomics: identify-ing key enzymes andmetabolites implicated in autism spectrumdisorders,” International Journal of Molecular Sciences, vol. 16,no. 4, pp. 8949–8967, 2015.

    [98] T.-C. Kuo, T.-F. Tian, and Y. J. Tseng, “3Omics: a web-basedsystems biology tool for analysis, integration and visualizationof human transcriptomic, proteomic and metabolomic data,”BMC Systems Biology, vol. 7, article 64, 2013.

    [99] I. Medina, J. Carbonell, L. Pulido et al., “Babelomics: an inte-grative platform for the analysis of transcriptomics, proteomicsand genomic data with advanced functional profiling,” NucleicAcids Research, vol. 38, no. 2, pp. W210–W213, 2010.

  • Research ArticleIntegrating Biological Covariates into Gene Expression-BasedPredictors of Radiation Sensitivity

    Vidya P. Kamath,1 Javier F. Torres-Roca,2 and Steven A. Eschrich1

    1Department of Biostatistics & Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL, USA2Department of Radiation Oncology, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL, USA

    Correspondence should be addressed to Steven A. Eschrich; [email protected]

    Received 30 September 2016; Revised 4 January 2017; Accepted 11 January 2017; Published 8 February 2017

    Academic Editor: Bethany Wolf

    Copyright © 2017 Vidya P. Kamath et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

    The use of gene expression-based classifiers has resulted in a number of promising potential signatures of patient diagnosis,prognosis, and response to therapy. However, these approaches have also created difficulties in trying to use gene expressionalone to predict a complex trait. A practical approach to this problem is to integrate existing biological knowledge with geneexpression to build a composite predictor. We studied the problem of predicting radiation sensitivity within human cancer celllines from gene expression. First, we present evidence for the need to integrate known biological conditions (tissue of origin, RAS,and p53 mutational status) into a gene expression prediction problem involving radiation sensitivity. Next, we demonstrate usinglinear regression, a technique for incorporating this knowledge. The resulting correlations between gene expression and radiationsensitivity improved through the use of this technique (best-fit adjusted 𝑅2 increased from 0.3 to 0.84). Overfitting of data wasexamined through the use of simulation. The results reinforce the concept that radiation sensitivity is not driven solely by geneexpression, but rather by a combination of distinct parameters. We show that accounting for biological heterogeneity significantlyimproves the ability of the model to identify genes that are associated with radiosensitivity.

    1. Introduction

    One of the goals of developing biomarkers is for use in patientselection, diagnosis, andmanagement of cancer treatment [1–3]. An important aspect in management of cancer treatmentis to understand how a patient will respond to a specifictreatment such as radiation therapy. Designing the radiationtherapy to maximize cancer cell death is beneficial, andpredicting such a response of the cells to radiation therapyis important for effective patient management. Genes suchas RAS [4, 5] and p53 [6] have been known to influencethe response of tumor cells to radiation treatment. Forexample, RAS has been implicated as a central regulator ofradioresistance. Similarly, presence of a mutant p53 geneis used as an indicator for uncontrolled proliferation ofcells, while a wild-type p53 gene is known to be a tumorsuppressor. In addition tissue of origin has been associatedwith radiosensitivity. For example, the SF2 (survival fractionof cells after 2Gy of radiation) of melanoma and glioma

    cell lines has been shown to be higher (radioresistant) thanlymphoma and myeloma cell lines [7–9].

    The process of developing the systems-based model ofradiosensitivity followed a stepwise strategy. The first stepwas to develop a radiosensitivity classifier to predict cellu-lar radiosensitivity based on gene expression profiles [10].We developed a multivariable linear regression model thatcorrelated gene expression to radiosensitivity as determinedby SF2, in a 35-cell line database. We used a leave-one-outcross-validation approach, where the classifier was developedusing 34 of the 35 cell lines as a training set, leaving onecell line as a test set. The basal gene expression profiles andthe radiation sensitivity of all 34 cell lines in the trainingset were used to identify genes that were correlated withradiosensitivity.Thiswas performedusing SAManalysis (Sig-nificant Analysis of Microarrays) [11] with a false discoveryrate of 5%. Genes selected by SAM were then combinedas radiosensitivity predictors during the construction of theclassifier. Amultivariable linear regressionmodel was created

    HindawiInternational Journal of GenomicsVolume 2017, Article ID 6576840, 9 pageshttp://dx.doi.org/10.1155/2017/6576840

    http://dx.doi.org/10.1155/2017/6576840

  • 2 International Journal of Genomics

    using these probesets to predict the SF2 of the test sample andwas shown to achieve a statistically significant (𝑝 = 0.002)predictive accuracy of 62%, within a continuous classificationproblem. The classifier predicts an actual SF2 value (range:0.01–1.0) rather than a binary phenotype (radiosensitiveversus radioresistant). Importantly, we biologically validatedthe model by demonstrating that three of the genes selectedby the algorithm (rbap48, rgs-19, and top-1) were mechanisti-cally involved in radiation response. Thus, we demonstratedthat cellular radiosensitivity is predictable based on geneexpression but more importantly we validated this approachas a strategy for the discovery of novel radiosensitivitybiomarkers.

    Although we had developed a successful mathematicalmodel correlating gene expression and radiosensitivity, wereasoned the model had a number of problems that ifovercome would significantly improve its ability to impactthe field of radiation biology. First, expansion of the cell linedataset from 35 samples should provide more reliable cor-relations. Second, there were few genes consistently selectedby the classifier. A larger pool of genes would be desirable,as it would allow us to identify the biological networksthat regulate cellular radiosensitivity. Third, gene expressionwas the only variable considered in the model, while thereare several biologic factors besides gene expression that areknown to influence radiosensitivity.Therefore we focused onstrategies aimed at increasing the pool of candidate genes andincorporating biologic variables into the algorithm. One ofthe advantages of developing the classifier in the NCI-60 isthat these cell lines are molecularly well characterized, thusallowing the inclusion of important biological variables intothe process.We chose four variables that have been previouslycorrelated to radiation sensitivity: gene expression [10], tissuetype [8, 12], RAS mutation status [13–18], and p53 mutationstatus [19–21]. In addition we expanded the cell line datasetfrom 35 to 48 cell lines.

    2. Material and Methods

    2.1. Microarrays. Gene expression profiles were from Affym-etrix HU6800 chips (7,129 genes) from a previously publishedstudy [22]. These are publicly available as supplemental datato the published study. The gene expression data had beenpreviously preprocessed using the Affymetrix MAS 5.0 algo-rithm in average difference units. Negative expression valueswere set to zero and the chips were normalized to the samemean intensity. Specific cell lines used are listed in Supple-mental Table 1 in Supplementary Material available online athttps://doi.org/10.1155/2017/6576840.

    2.2. Radiation Survival Assays (SF2). The SF2 of cell linesused inmodel development were previously reported [10, 23].SF2 values are included in Supplemental Table 1.

    2.3. Permutation Analysis. Predictions were randomly per-mutated among cell lines 10,000 times and accuracies greaterthan or equal to the threshold were counted to calculate a 𝑝value for significance relative to chance.

    2.4. Gene Expression Model. Gene expression and radiationsensitivity were described through a linear relationship asdescribed in (1). In this equation, SF2

    𝑛represents the radi-

    ation sensitivity (as measured by SF2) for cell line 𝑛 in thedataset. 𝑘

    𝑖represents a model coefficient, computed during

    the training process, and 𝑦𝑛𝑖represents the gene expression

    value for the 𝑖th probeset for cell line 𝑛. The least-squares fitof the individual linear models was compared when selectingprobesets of interest for modeling radiosensitivity.

    Gene Expression-Only Model

    SF2𝑛= 𝑘

    0+ 𝑘

    1(𝑦

    𝑛𝑖) . (1)

    2.5. Inclusion of Biological Covariates in Model Development.We hypothesized that incorporating biological covariatesinto the gene selection process would improve the abilityof the algorithm to identify radiosensitivity biomarkers.To integrate biological covariates into model developmentwe constructed individual gene-based models using twodifferent equations to relate gene expression and the biolog-ical parameters to radiosensitivity (SF2). Specific biologicalparameters are tissue of origin (TO), RAS mutation status(RAS), and p53 mutation status (p53).

    Additive Model

    SF2𝑛= 𝑘

    0+ 𝑘

    1(𝑦

    𝑛𝑖) + 𝑘

    2(TO𝑛) + 𝑘

    3(RAS𝑛)

    + 𝑘

    4(p53𝑛) .

    (2)

    Interactive Model

    SF2𝑛= 𝑘

    0+ 𝑘

    1(𝑦

    𝑛𝑖) + 𝑘

    2(TO𝑛) + 𝑘

    3(RAS𝑛)

    + 𝑘

    4(p53𝑛) + 𝑘

    5(𝑦

    𝑛𝑖) (TO

    𝑛)

    + 𝑘

    6(𝑦

    𝑛𝑖) (RAS

    𝑛) + 𝑘

    7(TO𝑛) (RAS

    𝑛)

    + 𝑘

    8(𝑦

    𝑛𝑖) (p53

    𝑛) + 𝑘

    9(TO𝑛) (p53

    𝑛)

    + 𝑘

    10(RAS𝑛) (p53

    𝑛)

    + 𝑘

    11(𝑦

    𝑛𝑖) (TO

    𝑛) (RAS

    𝑛)

    + 𝑘

    12(𝑦

    𝑛𝑖) (RAS

    𝑛) (p53

    𝑛)

    + 𝑘

    13(TO𝑛) (RAS

    𝑛) (p53

    𝑛)

    + 𝑘

    14(𝑦

    𝑛𝑖) (TO

    𝑛) (RAS

    𝑛) (p53

    𝑛) ⋅ ⋅ ⋅ .

    (3)

    In (2) and (3), the cell line radiosensitivity (SF2𝑛) was

    modeled as a function of gene expression (𝑦) and biologicalvariables (TO, RAS, and p53). Specifically, SF2

    𝑛represents

    the radiosensitivity of cell line 𝑛 and 𝑦𝑛𝑖represents the gene

    expression value of an individual probeset (i) for the 𝑛th cellline in the dataset. A total of 9 different TO values werepresent in the 48 cell line database. RAS

    𝑛and p53

    𝑛were

    binary variables (wild-type/mutated) for the 𝑛th cell line.Thus, the additive model considered a total of 13 terms (anintercept, gene expression, 9 TO, RAS, and p53). The morecomplex interactive model initially considered all possible

    https://doi.org/10.1155/2017/6576840

  • International Journal of Genomics 3

    Table 1: Terms used in linear modeling. The term (𝑦) representsgene expression. The operator × represents an interaction termbetween two or more variables.

    Terms TermsIntercept 𝑦 × tissueTypeBREAST𝑦 (gene expression) 𝑦 × tissueTypeCNStissueTypeBREAST 𝑦 × tissueTypeCOLONtissueTypeCNS 𝑦 × tissueTypeLEUKtissueTypeCOLON 𝑦 × tissueTypeMELANtissueTypeLEUK 𝑦 × tissueTypeNSCLCtissueTypeMELAN 𝑦 × tissueTypeOVARtissueTypeNSCLC 𝑦 × tissueTypePROSTATEtissueTypeOVAR 𝑦 × RASmuttissueTypePROSTATE 𝑦 × p53mutRASmut tissueTypeBREAST × RASmutp53mut tissueTypeCOLON × RASmut

    tissueTypeMELAN × RASmuttissueTypeNSCLC × RASmuttissueTypeOVAR × RASmut𝑦 × tissueTypeBREAST × RASmut𝑦 × tissueTypeCOLON × RASmut

    terms and 2-, 3-, and 4-way interactions among these terms.Without accounting for linearly dependent terms, there are180 terms total, far more than the number of observations(48). These include an intercept, 14 terms involving a singlevariable (gene expression, 9 TO, 2 p53, and 2 RAS), 53 pairedterms, 76 triples, and 36 terms with four variables interacting.

    While the equations represent models with very largenumber of variables, the number of nonsingular terms wasfar less due to the small sample size. Additionally, linearlydependent variables (typically interactions with no examplespresent) are dropped from the model. Interactions of largernumbers of variables were dropped in favor of fewer in thecase of linearly dependent variables. Thus there are only 29terms in the linear model (an intercept, gene expression,9 TO, p53, RAS, 15 two-way interactions, and 2 three-wayinteractions) (Table 1). A gene-based linear model was con-structed for each gene (7168 probesets), correlating expres-sion and biological parameters with the measured SF2 usinga least-squares fit. We compared the sum squared error of thegene expression-based linear models to the null model, con-sisting of biological parameters and no expression (SSE = 1.2).

    2.6. Random Variables. Random variables for exploring theeffect of RAS and p53 mutation status were created anduniformly distributed into two states (one each for themutated and wild-type status).The frequencies of these stateswere similar to the true distributions in the data. Similarly,a random variable was defined for TO, with each samplebeing assigned a tissue type at random.This new dataset withrandomly assigned biological parameters was used to testwhether the improvement in linear fit achieved by both theadditive and interactive model was due to the integration ofbiological variables or due to chance.

    3. Results

    3.1. Expansion of Cell Line Dataset Lowers ClassificationAccuracy. As described above we previously developed agene expression radiosensitivity classifier [10] as a continuousprediction rather than a binary classification problem (i.e.,radiosensitive versus radioresistant). During development ofthe model we had observed that increasing the number ofsamples increased the classifier accuracy (data not shown).Thus we hypothesized that increasing the cell line datasetto 48 cell lines would result in a more accurate model.Surprisingly, the classifier techniquewas not as accuratewhenthe cell line population was increased to 48 (compared to35) cell lines. The best linear regression-based classifier usingthe 48 cell lines correctly classified 26/48 samples (54%)(Figure 1(a)) compared to 25/35 (71%) for the best classifierin the 35-cell line dataset. We explored the use of alternatenormalization (Figure 1(b)); however themaximumaccuracywas 28/48 or 58%. Additionally, we looked at alternatepredictors (Figure 1(c)) but the decreased accuracy in the 48-cell line dataset was consistent. Although the results werestill statistically significant in that the classifier in the 48 cellline dataset performed better than chance (𝑝 = 0.0094), wewere interested in understanding the reason for the decreasedaccuracy.

    3.2. Understanding the Influence of Confounding Factors. Thedecrease in classification accuracy suggested that the linearregression model based only on gene expression data did notfully represent the classification problem. We hypothesizedthat accounting for the biological diversity of cell lines inthe database would be of importance. Several biologicalvariables available for the NCI-60 cell lines include tissue oforigin (TO), RASmutational status (wt/mut) (RAS), and p53mutational status (p53).These variables have been implicatedin the biological regulation of radiation sensitivity [13, 24].Among the 48 cell lines, the RAS-mutated cell lines representonly 31% (15/48) of cell lines whereas they represented40% (14/35) in the 35-cell line database (Figure 2(a)). Thep53 mutation status was also different between the twogroups; 26 cell lines were p53 mutants in the 35 cell lines;however only 5 additional mutants were added, changing theproportions from 74%down to 65%of the cell line population(Figure 2(b)). Tissue of origin was similar in proportions inthe two groups (Figure 2(c)). Since only one additional RAS-mutated cell line was added when increasing the dataset to48 we first focused on determining if RAS mutation statusimpacted the gene selection process.

    The oncogenic protein RAS has been proposed tomediatea central mechanism in radiation resistance [16]. We testedwhether the presence of a RAS mutation, which usuallyaffords a chronically active RAS protein, was an importantsource of variability within the dataset. This was done bydetermining whether the genes selected by the 35 cell lineclassifier were dependent or independent of RAS status. Westratified the original 35 cell lines by RAS status and per-formed the gene selection step (correlation of gene expressionand SF2) in each group of cell lines. The three genes (rbap48,rgs-19, and r5pia) selected by the original classifier (without

  • 4 International Journal of Genomics

    Radiation sensitivity prediction versus number of features

    10 15 20 30 40 50 75 100Number of features

    linear regression, not colinearlinear regression, colinear

    0

    10

    20

    30

    40

    50

    60Ac

    cura

    cy (%

    )

    (a)

    Linear regression

    10 15 20 30 40 50 75 100Number of genes

    MAS5.0MAS4.0

    RMARMA unlogged

    20253035404550556065

    Accu

    racy

    (%)

    (b)

    MAS5 normalized versus number of features

    10 15 20 30 40 50 75 100Number of features

    linear regression, not colinearlinear regression, colinear

    leastmedsq

    0

    10

    20

    30

    40

    50

    60

    70

    Accu

    racy

    (%)

    (c)

    Figure 1: Investigation of building predictors for radiation sensitivity in 48 cell lines. (a) Classification accuracy of radiation sensitivitypredictor built from 48 cell lines, using different numbers of features in the regression model. (b) Classification accuracy of radiationsensitivity predictor built from 48 cell lines, using different types of normalization. MAS5.0 and MAS4.0 algorithms generated the mostaccurate predictors. (c) Classification accuracy of radiation sensitivity predictor built from 48 cell lines, using different types of classificationalgorithms, including linear regression, least median, and SMO.

    Table 2: Ranking of previously validated radiosensitivity genes when considering all cell lines (𝑛 = 35), RAS-mutated cell lines only (𝑛 = 14),and RAS wt cell lines only (𝑛 = 21). Significant differences in ranking occur when considering the biological variable of RASmutation status.

    Gene Overall ranking RAS-mutated cell line RAS wt cell linesrbap48 5 19 743rgs-19 1 46 758r5pia 9 262 397

    RAS stratification) were previously shown to be highly usefulin predicting radiosensitivity.These genes were highly rankedamong the RAS-mutated cell lines but not in the wild-typelines, suggesting that the RAS-mutated cell lines were drivingthe classification process. RbAp48, rgs-19, and r5pia wereranked 19th, 46th, and 262nd out of 7,129 probesets by 𝑅2values from the RAS-mutated cell lines. In wild-type celllines, these same genes are ranked 743rd, 758th, and 397th,

    respectively. Interestingly, these three genes ranked in the top10 genes when all cell lines were considered together (5th, 1st,and 9th) (Table 2). These results suggest that the biologicaldiversity of cell lines studies (e.g., RAS-mutated and RAS wt)can significantly impact the evaluation of genes with respectto outcomes. In particular, two diverse biological typesmixedin different proportions can lead to highly variable ranking asdemonstrated by our 35-cell line experiment.

  • International Journal of Genomics 5

    Differences in RAS mutation status

    RAS wt RAS mut

    3548

    25

    35

    45

    55

    65

    75Pe

    rcen

    t of c

    ell l

    ines

    (a)

    Differences in p53 mutation status

    p53 wt p53 mut

    3548

    0

    10

    20

    30

    40

    50

    60

    70

    80

    Perc

    ent o

    f cel

    l lin

    es

    (b)

    Differences in tissue of origin

    3548

    00.020.040.060.08

    0.10.120.140.160.18

    Perc

    ent o

    f cel

    l lin

    es

    Col

    on

    Ova

    rian

    Leuk

    emia

    Mel

    anom

    a

    NSC

    LC

    CNS

    Pros

    tate

    Rena

    l

    Brea

    st

    (c)

    35 cell lines

    RAS wild-typeRAS mutant

    285

    7

    9166240

    53176

    (d)

    Figure 2: Biological characteristics differ when considering 35 cell lines versus an expanded set of 48 cell lines. (a) The proportion of RASwild-type cell lines increased (60% to 69%). (b) The proportion of p53 wild-type cell lines increased (26% to 35%). (c) Tissue of origin ofcell lines did not change significantly. (d) Venn diagram showing the lack of concordance in correlation when using a test for correlation(𝑝 < 0.05) using only RAS mutant or RAS wt cell lines in the 35-cell line set. Only 16 probesets were found correlated in both sets.

    3.3. Integrating Biological Covariates. As a result of theanalysis of confounding factors, three variables (TO, RAS,and p53)were integrated in the gene expression analysis usingtwo approaches: an additive model and an interaction-basedlinear model. The gene selection process was repeated usingthese approaches on the 48 cell lines. RAS and p53 statusindicators were binary variables that indicate wild-type (wt)or mutational (mut) status of the gene for a cell line. Theindicator for tissue of origin (TO) has 9 levels, one for eachtype of tissue from which the tumor cell line originated [22].The analysis was performed for each probeset and the modelfit parameter adjusted-𝑅2 (Adj-𝑅2) was used to determineif the model improved by inclusion of the covariates. Theadjusted-𝑅2 was used instead of 𝑅2 in these experiments toadjust for addition of regressors in the equations.

    Figure 3 shows a box plot summarizing the Adj-𝑅2 valuesfrom all probeset models individually when correlated withradiation response (SF2) in the 48-cell line database. In thegene expression-only model, fewer probesets had a model fitbetter than 0.2 (

  • 6 International Journal of Genomics

    Table 3: Change in Adj-𝑅2 value obtained by adding terms and complexity to the linear model. Results obtained with clinical indicators TO,RAS, and p53 are compared to Adj-𝑅2 values obtained using random variable for each indicator.

    Model terms Model comparison Mean Δ𝑅2 value

    Clinical indicators Random variables

    GeneEx : TO GenEx only versus additive 0.254 0.256Additive versus interaction 0.134 0.146

    GeneEx : RAS GenEx only versus additive 0.060 0.004Additive versus interaction 0.030 0.031

    GeneEx : p53 GenEx only versus additive 0.026 0.0007Additive versus interaction 0.016 0.031

    GeneEx : TO : RAS Basic versus additive 0.256 0.257Additive versus interaction 0.272 −0.213

    GeneEx : TO : p53 Basic versus additive 0.262 0.257Additive versus interaction 0.198 −0.211

    GeneEx : RAS : p53 Basic versus additive 0.062 0.022Additive versus interaction 0.042 0.024

    GeneEx : TO : RAS : p53 Basic versus additive 0.265 0.258Additive versus interaction 0.317 −0.103

    Gene expression-only

    models

    Additivemodels

    Interactivemodels

    Adju

    stedR2

    values for linear models with biological variables

    0.0

    0.2

    0.4

    0.6

    0.8

    Adj-R2

    Figure 3: Adj-𝑅2 values for linear equations fitting SF2 on 48 celllines. Adj-𝑅2 values increase systematically as more covariates areincluded in the linear model.

    obtained using variables with randomly generated values.Random variables that do not have any meaningful infor-mation and are uncorrelated to the outcome are expected toproduce models with lower Adj-𝑅2 values.

    Table 3 shows the change in the model fit (Δ𝑅2) whenterms are added to a linear model. Both the change in fitfrom biological indicators and randomly generated variablesare recorded. For each biological covariate (RAS, p53, andTO), inclusion of the variable in an additive model doesnot improve the model fit more than including randomly

    generated variables. The inclusion of TO in the additivemodel provides nomore information thanwould be expectedby chance (average change in 𝑅2: TO 0.254, random 0.256).Even with the addition of multiple terms, the additive modelimproves no better than by chance. When gene expression,TO, and RAS are combined in the additive model, thecorrelation of the model improves by 0.256. However, thesame improvement is observed when the random variable isadded (Δ𝑅2 = 0.257).

    The difference between including biological variables andrandom variables in the interaction-based models is moresignificant. For example, the change in 𝑅2 for the additivemodel using RAS, TO, and gene expression was similar tothat of random variables; however in the interaction model,the correlation improves by 0.272 whereas the interaction ofrandom variables (for TO and RAS) drops by 0.213 (Δ𝑅2 =−0.213). When including all three terms in the interactionmodels, the Adj-𝑅2 improves by 0.317 but the randomvariables cause a drop in correlation (Δ𝑅2 = −0.103).

    Figure 4 summarizes the trend that when two or morebiological variables are considered, this results in better linearmodels than expected from randomly generated variables.The interaction of random variables with gene expressiondata alone provides a marginal improvement in the fit;however, when two or more random variables interact, thelack of information in each variable translates into poorerfit of the linear model to the radiation sensitivity outcome.In contrast, the interaction of the biological variables addsmore information to the linear model, as shown by theimprovement in Adj-𝑅2 values in Table 3 and Figure 4.

    4. Discussion

    The central aim of our research efforts is the developmentof a systems biology-based understanding of the biological

  • International Journal of Genomics 7

    TO RAS