18
Mapping disease risk to individual genomes Safyre Anderson | James Gray | Gunnar Kleemann Berkeley MIDS W205 Summer 2015 Final Project August 17, 2015

Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Mapping disease risk to individual genomes

Safyre Anderson | James Gray | Gunnar Kleemann

Berkeley MIDS W205 Summer 2015 Final Project

August 17, 2015

Page 2: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

1.0 Introduction

1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently unique from case to case. This raises difficulties for medical practitioners to prescribe targeted treatments for a particular patient. Increasingly, researchers are applying powerful statistical tools that derive insights into a patient’s genome that, combined with medical evidence, can empower doctors to prescribe more effective treatments. However there is a large gap in the understanding between the experts interpreting genetic data and the patient who seeks to benefit from their genomic data. There is a large corpus of high quality genetic data that is freely available on the web. The primary intent of this project was to design an architecture and a implement a solution that can read a patient’s genome data and present disease information in an accessible format that encourages exploration.

1.2 Related Work There are a number of services that attempt to provide context and insight into genetic mutation data. GeneCards provides rich gene­level reports, Malacards provides disease specific data, report services like OMIM and GeneReviews combine both disease and mutation data in a single report. There are tools like Solvebio and Knome which compile genomic data from multiple sources and attempt a general synthesis. However, these tools are largely targeted at experts. One company, 23andMe, has attempted to produce a similar analysis targeted at consumers. Notably 23andme have not been universally successful in providing this service since they were ordered to stop by the FDA (Brandom 2013, FDA 2013, but see FDA 2015 Herper 2015).

1.3 Contributions The intent of our system is to provide a compelling user experience for ordinary people who would like to learn more about their genomes. As mentioned above, there are genome profiling services such as 23andme who provide a similar service but as data scientists we propose to present dynamically updated content and interactive visualizations. We intend that the main contribution is the focus on usability of the interface and the interface a compelling educational tool. While the user experience is not in its final form, we have assembled the computational infrastructure including a set of data storage/retrieval strategies that will enable us to layer an evolving user experience on top of it. Our approach is similar to others applying data science techniques to genomics by leveraging available tools and infrastructure including Amazon Web Services, Galaxy and Python.

1.4 Functional Overview The main functional and data analysis tasks are to:

identify mutations in a user’s genome (step 1) combine those variations with pertinent clinical data about them (step 2)

1

Page 3: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

present the user with a visualization of these variations by risk level with reference to medical resources for detailed facts (step 3)

Figure 1 ­ Functional Data Processing Pipeline

Step 1­ Genome processing: The workflow takes in the user’s genetic sequencing data and uses algorithms and cloud computing to find out how the user’s genome differs from an average human genome. Step 2 ­ Dangerous mutation identification: Once the differences (mutations) are identified the workflow culls the list to include mutations which we know something about. We present a set of mutations that can be further analyzed with existing data. Step 3 ­ Mutant report: After the mutation list is filtered to known mutations, data about each mutation is collected and stored as part of the cleaned mutation file. These mutations are presented in an overview and when requested, a mutation report can be browsed and used as a starting point to explore the mutation further.

2

Page 4: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

2.0 Background Material

2.1 Data Storage and Retrieval Model Addressing our problem statement requires acquisition and computation of large scale genome data sets across an end­to­end pipeline. Therefore we selected a cloud­based execution environment (Amazon Web Services) that would enable the genome alignment and variant detection process given the large files sizes (10­20GB) and ability to scale resources as needed.

Figure 2 ­ Technical Architecture The data retrieval model acquires data from three primary sources:

Amazon S3 ­ Human genomes (1000 genomes project) NIH Clinical Variations The Cancer Genome Atlas

The first stage of the genome process step is enabled by Galaxy, an open­source application for biomedical research. Our initial assumption was that data processing time would vary depending on file size and how many genomes we could process in parallel. Data acquisition was scheduled in Galaxy and these files are downloaded automatically from S3 to the EC2 attached storage (see Reference section for details). The clinical variation data and Cancer Genome Atlas data was scheduled and downloaded via ftp. We allocated approximately 1TB of EC2 storage for the staging and processing of these data files. Additional details on file size and processing are described in Section 3.

3

Page 5: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

The goal of the data processing model is to transform large, semi­structured data files into smaller curated data structures for analytics and visualization. The public genome data on S3 typically has multiple genome files for each person. These file sizes range from 2 ­ 15 GB compressed. The output of the genome processing step produces a mutation table file (VCF) approximately 100­200 MB. The individual’s mutation table files are transferred to S3 for persistent storage using the AWS Command Line Interface (CLI) tool.

2.2 Used Analysis The alignment program (bowtie2) algorithm prunes the genome then uses double indexing with a bidirectional Burrows­Wheeler transform to position short relative to the reference. The variant calling step utilizes a the freebayes algorithm which uses bayesian analysis to determine the likelihood of a specific change given the other changes seen in the same region across all reads. The tumor analysis used clustering with K nearest neighbors to find probabilities of association to cancer.

2.3 Techniques Used for Data Representation Our final data processing step produced a “known mutation” file that was exported into a comma separated values format to enable data visualization. We utilized Microsoft Power BI cloud visualization software to represent the data and enable drill­down and pivots across various dimensions. We choose this software for its ease of use and ability to empower end users to explore the data and build visualizations depending on their perspective.

3.0 Data Storage and Retrieval Design Architecture

3.1 Known Challenges The main data related challenges are to 1) take a user’s genetic data and identify mutations, 2) choose the dangerous mutations and 3) contextualize those mutations in a way that is meaningful to a generally intelligent but non­expert user. Identification of mutations involves using an alignment algorithm to organize small (~100 base pair) genetic reads relative to a 3 billion base pair genomic reference. A typical method of implementing bioinformatics packages is to use the Galaxy server which supports genomic analysis such as sequence alignments and variant calling. However the public Galaxy genomic server is limited in that it is slow and not very customisable. This is one of the reasons we chose to implement our own version of Galaxy on AWS. A further challenge is the scope of the project. We are using a number of different types of datasets and analysis. We know that some of these processes are error prone and wherever possible we have tried to identify contingency plans and alternative methods to complete the required data analysis tasks. For example, alignment was cumbersome and we are continuously investigating alternative alignment methods.

4

Page 6: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

There are a number of specific problems and constraints for data retrieval and data storage that we addressed with our architecture.

1. There are multiple file formats that need to be combined. 2. Whole genome sequence files are large and can be up to 20 GB and each these need to

be stored and distilled to usable information. 3. The TCGA data exists as multiple .MAF genomic feature files that will need to be

organized and then subjected to a machine learning classifier to identify mutations that are overrepresented in tumors. Methods for both tasks are being developed currently.

4. The machine learning section includes analysis on ~1000 tumors with ~1000 variants each. The process needed to retrieve, compile, and analyze these data to get a filtering set of mutations.

5. Some bioinformatics packages can be hard to install and it was not clear whether we would be able to get them to work on AWS for the genome processing stage. To address this concern we set up a working version of the pipeline on the galaxy server (usegalaxy.org) running on AWS, in a production implementation of this pipeline, we would install our own version of the pipeline on an AWS virtual machine.

The Cancer Genome Atlas Training Data TCGA stores all of its tumor sequence data (as well as its other clinical data and metadata) in a publicly available File Transfer Protocol (FTP) network. These data come from multiple research institutions and thus vary slightly in format. However, generally they contain mutation counts with their positions along each chromosome. A portion (from Washington University’s Genome Institute) of the TCGA breast cancer repository is located here. In the repository, mutation data (and related metadata) files are stored in zipped and unzipped directories. The mutation data are stored in both .maf (mutation annotation format) or .txt; both structured with similar data table formats. File sizes vary greatly in the TCGA breast cancer repository, ranging between 0.2­18 MB. Since the files are stored on an open access FTP, we developed a webscraper from the Python scrapy package which crawled the breast cancer repository. The spider downloaded files with extensions .MAF using urllib.request() which were then manually uploaded to S3. We could have used the Python package boto to upload the data automatically. However, we found some files contained duplicate data even though the names were different. In another AWS S3 bucket, raw TCGA .maf data were stored separately from the Amazon 1000 Genomes alignment process. Additionally, a sample of 4 aligned 1000 genomes sequences (200MB each) was be imported to the S3 bucket and standardized from its .vcf format. These 4 aligned .vcf sample represented an entire genome for random anonymous subjects. Both sources of data were processed and reformatted in R, then saved as a .csv. The .csv format contains the sparse matrix of the mutations per sample as well as a column containing the labels of each sample (cancer =1, no cancer = 0). This format will allow for relatively simple separation of the features and labels for python’s scikit­learn machine learning classifiers while still being usable for Amazon Machine Learning. AML is able to run parallelized machine learning algorithms directly from S3, which improves the scalability of our workflow.

5

Page 7: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

3.2 Data Retrieval This project was scoped to process genomes from ten humans given the constraints of data processing time and data storage. We also selected a cross section of humans with different ethnicity including Great Britain, Finnish, Chinese, Puerto Rican and Columbian. The objective was to analyze potential linkages of mutations with particular ethnic backgrounds. The genome data was retrieved from the AWS S3 bucket http://s3.amazonaws.com/1000genomes/. The following genome files were retrieved and loaded into the Galaxy Server:

1. http://s3.amazonaws.com/1000genomes/phase3/data/HG00125/sequence_read/ERR031932_1.filt.fastq.gz (GBR Woman ­ 3.8 GB compressed)

2. http://s3.amazonaws.com/1000genomes/phase3/data/HG00138/sequence_read/ERR016162_2.filt.fastq.gz (GBR Male ­ 2.2 GB compressed )

3. http://s3.amazonaws.com/1000genomes/phase3/data/HG00378/sequence_read/ERR031983_1.filt.fastq.gz (HG00378 Finnish female ­ 4.0 GB compressed)

4. http://s3.amazonaws.com/1000genomes/phase3/data/HG00382/sequence_read/ERR251879_2.filt.fastq.gz (HG00382 Finnish male ­ 4.7 GB compressed)

5. http://s3.amazonaws.com/1000genomes/phase3/data/HG00445/sequence_read/ERR251086_1.filt.fastq.gz (HG00445 Chinese male ­ 5.8 GB compressed )

6. http://s3.amazonaws.com/1000genomes/phase3/data/HG00446/sequence_read/ERR032021.filt.fastq.gz (HG00446 Chinese female ­ 5.8 GB compressed)

7. http://s3.amazonaws.com/1000genomes/phase3/data/HG01161/sequence_read/SRR793494_2.filt.fastq.gz (HG01161 Puerto Rican male ­ 3.7 GB compressed)

8. http://s3.amazonaws.com/1000genomes/phase3/data/HG01162/sequence_read/SRR792591_1.filt.fastq.gz (HG01162 Puerto Rican female ­ 3.9 GB compressed)

9. http://s3.amazonaws.com/1000genomes/phase3/data/HG01124/sequence_read/ERR022462_2.filt.fastq.gz (HG01124 Colombian male ­ 5.1 GB compressed)

10. http://s3.amazonaws.com/1000genomes/phase3/data/HG01125/sequence_read/SRR098961.filt.fastq.gz (HG01125 Colombian female ­ 6.5 GB compressed)

11. http://s3.amazonaws.com/1000genomes/phase3/data/HG02952/sequence_read/ERR250469_2.filt.fastq.gz (HG02952 Nigeria female ­ 3.1 GB compressed)

12. http://s3.amazonaws.com/1000genomes/phase3/data/HG02947/sequence_read/ERR183444.filt.fastq.gz (HG02947 Nigeria male ­ 5.6 GB compressed)

The files above were manually added to the Galaxy configuration then automatically downloaded to the EC2 attached storage. 3.3 Data Storage All S3 genome files were downloaded onto the EC2 Galaxy server for data processing. These data were aligned and variants called. At completion of alignment, the original files were deleted and the resulting VCF file was loaded into an S3 bucket then pushed to the variant analysis phase. The mutation reference dataset (ClinVar) as well as the filtered mutation file

6

Page 8: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

(CSV) resulting from successive analysis steps were also are stored in the S3 bucket. We also retrieved the 1000 genomes VCF files the to serve as a reference for the same test individuals.The reference files are stored in another S3 bucket an example individual file is here and the overall summary file for all individuals provided by 1000 genomes is here.

4.0 Architecture Implementation (Results) The architecture is composed of three modular components that align closely to the data processing pipeline in Figure 1.

4.1 Implementation details/examples In order to build an application that can take an individual’s sequencing data from a consumer service such as illumina, and generate a report detailing the unique features of that genome and their biological implications we start with raw sequencing data. The raw sequencing data is in the form of .fasta.gz files (2­25GB each) stored in the AWS 1000 genomes S3 bucket. The data is composed of raw sequence reads data consisting of 4 alternative base pairs (ATCG) in series of ~100 characters. These “shotgun sequences” are actually overlapping reads DNA randomly fragmented that need to be mapped back on the reference genome (Figure 1 step 1), important differences from the reference identified (Figure 1 step 2) and contextualized in terms of known biology (Figure 1 step 3). The workflow takes in raw (.fasta) sequencing data and process the reads using bowtie2 which leverages a Burrows­Wheeler index to rapidly align short sequencing reads. Variants will be called on the resulting .BAM alignment file using FreeBayes, a Bayesian haplotype­based variant discovery tool. In order to use well maintained versions of these tools we chose to use the Galaxy bioinformatics service which hosts and maintains all of the major bioinformatic packages. Our alignment pipeline in available here.

We also chose to deploy and run an instance of the Galaxy server on AWS using the Cloudman distribution to enable scalability and flexibility. When running Galaxy on the cloud we had to consider 3 data bins as discussed here:

data volume tools volume indices volume ­ note that we can reduce our indices size by focusing only on certain

regions of the genome. Data into the Galaxy server is ready directly from the Amazon 1000 genomes S3 bucket. For new users of our application we envision them uploading their genome files to a secure S3 bucket where we would read that by the same process. The data pipeline is constructed in Galaxy to produce a mutations table (.vcf)

7

Page 9: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Figure 3 ­ Galaxy data processing The scope for this project is ten genomes but the architecture could scale out to multiple Galaxy servers on an EC2 instance or even multiple EC2 instances running in parallel to improve throughput. Since the Galaxy server is a manually initiated workflow framework we can use it in the initial phases of this project. However as the service gains velocity we will need to build an alignment step that can be customized and called in an automated manner. In order to do this, future phases of the project will involve installation of the pipeline independently of galaxy on an AWS virtual server. Bowtie2 and FreeBayes packages available as open source packages.

4.2 Dangerous Mutation Identification (Step 2) Once variants are identified in step 1 we need to distill the list of mutations to a few that we want to report on. Initial runs have generated variant lists of about 50k candidate mutations. These are too many mutations to report, and the vast majority of these variations have no known consequence. To generate a compressed list of meaningful mutations, we are filtering the personal variant data by merging it with a pathological mutation data corpus (ClinVar and our tumor mutation list), using our python pandas script FilterMutations.py. The first corpus, is a mutation list that we are generating from the cancer genome atlas (TCGA) and contains sequence files from over 1000 malignant tumors. The second, is from the National Institute of Health’s (NIH) clinical variant database (ClinVar) and contains ~95,000 mutations with associated data about their clinical implication and characterization. In our test cases we had between 1 and 200 mutations left after filtering with the ClinVar mutation list. This list can be browsed or prioritized using the a criteria such as age of the data corpus on the mutation (first publication date) and top publication count as a proxy for the amount of information available. Both pieces of information are available through the human genome mutation database and can be programmatically accessed by scraping the HGMD site as shown here. 4.2.1 The Cancer Genome Atlas Data Acquisition Mutation data are stored in tab­delimited .maf files within various directories on the TCGA file transfer protocol (FTP) site. The files can be downloaded manually. However, we have found the mutation files are scattered through different directories which are organized by data origin and research institution. We constructed a scrapy web crawler that would start crawling from the breast cancer data repository. The web crawler “tcgaSpider” is a child class that inherits from scrapy’s CrawlSpider class and follows links within the body of each page (ignoring header

8

Page 10: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

links). The links are extracted within the <pre> and <a> tags of the html bodies. This method of extracting links to .maf files is quite scalable. The extension is unique to mutation data (as opposed to .txt). Each .maf file listed represents a unique sample. Furthermore, if the algorithm were to be expanded to include features from other types of cancers, we merely need to change the starting url to include directories that contain data for other cancers.

4.2.2 K Nearest Neighbors Classifier The combined TCGA data were used to train a K­nearest neighbors classifier from python’s scikit­learn that used Euclidean distances as its weights. Since the current data only consist of mutation location information, K­nearest neighbors should be a relatively powerful, yet lightweight classifier. 80% of this training set was used to train a K­nearest neighbors classifier. From posterior probabilities, we found the genes that encountered the most mutations were the following: ABCA13, COL14A1, MUC16, PIK3CA, TP53, TTN, and USH2A. However, the score of the classifier was surprisingly good: 0.62 in one iteration with over 50,000 features. This is likely due to overfitting the model with an abundance of breast cancer samples and may be remedied by collecting more .vcf data and running the analysis on EC2 instances. Below is a figure displaying genes with higher rates of mutations than the rest of the genes represented by all of the TCGA data (Figure 4).

9

Page 11: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Figure 4. The TCGA data in total consisted of over 700 unique tumors. The combined training data encompassed over 623000 mutations (excluding non point mutations) of which the vast majority came from the 1000 genomes “control” samples.

4.3 Mutant Report (Step 3) The final step in the data processing pipeline is the mutant report that will provide an automatically generated web page which will be an entry point into the literature. The code to generate the report is still being constructed but will contain two levels; 1) a landing page that has a clickable heat map coded by priority as calculated in the HGMD scrape (discussed in section 2.2) and 2) a mutation report. The heat map provides an interactive overview of all of the putative pathogenic mutations found which allows the user to specify the mutation that they are interested in by clicking a region on the figure. Each mutation report will be generated on the fly from the current information available on the web. Each ClinVar entry is associated with an ‘rs’ reference number that can serve as an entry point to the mutation and disease specific content. The data harvesting strategy as follows. We might find the mutation for Hb constant spring (rs41464951) in one of our patients. We can then access the NCBI snp page by appending this number to the NCBI SNP address:

http://www.ncbi.nlm.nih.gov/snp/41464951

From the dbSNP page, we can collect the “clinical significance” and the “Gene name” field. Additionally the Pubmed, and OMIM links can be culled from this page. Finally, more distant

sources such as GeneCards (1) and the NIH gene function databases (2) can be linked by

appending gene name to their URL as below. The assignment of the last fields “Molecular model”, “Domain affected”, and “Disrupted function” are going to be more difficult because this information is distributed throughout papers and may require a degree of human curation. If this is the case, we will limit our case reports to those with clear structure/function relationships. In the future, the relevant domain information can be curated and downloaded into a MongoDB instance so that each mutation can be associated with a molecular image and a description of the types of domain­specific functions.

1) http://www.genecards.org/cgi‐bin/carddisp.pl?gene=HBA2

2) http://ghr.nlm.nih.gov/gene/HBA2 Demonstration mutant report This filtered mutations file was generated by subjecting unaligned sequencing reads from person HG00125 to our full workflow. This VCF formatted data was merged with the Clinvar mutation reference and transformed into CSV file using a Pandas DataFrame. The CSV file was then consumed by Microsoft’s Power BI cloud visualization software to generate the display below (figure 5).

10

Page 12: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Figure 5. An analysis from person HG00125. The left panel shows overview of all mutations on two chromosomes, and the right panel shows a drill down to look a mutation that conveys a high risk of pathology.

4.2 Test Cases We used existing genomes from the 1000 genome database to test our pipeline. In this step we were able to push two genomes through our analysis pipeline but 4 other genomes took a very long time to process (in excess of 1 day) and we ended up terminating the processing steps. In order to deal with long processing times we installed an instance of the Galaxy bioinformatics server on using cloudman software. However the process stalled after two days of processing of the third and fourth individual.

4.3 Scalability We deployed a cloud­based version of Galaxy on AWS EC2 clusters using auto­scaling (see Reference section for configuration details). We found that the genome alignment and variant identification part of the workflow (Step 1) required significant computational resources. Processing times varied from 3 hours to 2 days depending on the size of the genome file. It was typical for 2GB files to be processed within a few hours while 6GB+ files running on parallel

11

Page 13: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

workflows required almost 2 days. The AWS auto­scaling feature was configured with 3 nodes to support processing the genomes in parallel.

4.4 Insights Learned

There were relatively few known dangerous mutations in the ten individuals that we looked at. However when we compared our variant feature files to those generated by 1000 genomes for individual HG00125 we found more known mutations suggesting that our pipeline may be less restrictive and preserve more relevant data.

Genome alignment and variant calling is a major bottleneck and needs to be optimized. The time required for alignment appears exponentially scale and can span days for large files. This may be avoided by submitting smaller files for alignment and then re­combining files after variant calling is finished. Such a strategy would not be possible if we wanted to combine complex intermediated files (.BAM and .BAI) however once the files are reduced to simple summaries of mutants (such as .VCF) they can be combined. However this strategy would have to be studied to ensure that it really reduces processing time and it does not introduce an intolerable amount of noise. Another way to reduce storage and computational resources required for the genome processing step is to limit our reference file to 1000 base pair windows around variants with known pathological significance defined by the ClinVar pathological mutations data corpus.

Incorporating 1000 genomes data into the training set also proved to be cumbersome. However, it’s not clear at this point how we could compress the 1000 genomes data. We could improve the stability of the process by either keeping the formatting code in python (as opposed to R) while improving the flexibility of that stage to interface with other languages/ techniques. The combined data set is several GB in size before compression. Running Python locally also has its limitations with regard to memory and processing power. However, Python is much nicer for integrating with AWS.

5.0 Related Works As mentioned above, there are a number genome profiling services (GeneCards, Malacards, OMIM and GeneReviews) provide detailed mutation reports. Other tools such as Solvebio and Knome which compile genomic data.There are services such as, 23andMe, which sequences users and produces genetic reports to consumers. However what we propose here is different because we seek to go beyond a static report and draw the user into explorable web content.

6.0 Conclusion We have prototyped a personal genetic mutation calling pipeline designed to consume sequence reads and to produce accessible biological information that can serve as an entryway into the literature for general users. Our pipeline is functional but is mainly intended to be a scaffold to support a richer layer placed on top of it thus we have laid a functional foundation but more development is needed. One important point is that simple genetic data can be scraped and presented easily but the most useful type of information is explanatory data that exists in functional descriptions and diagrams. Manual curation can be used to get the most useful information but we do not have a good method to collect the richest sources of information automatically. While there is a lot of genomic data available, conveying useful insight from this

12

Page 14: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

data is the major challenge; mutation data itself is not predictive of cancer as cancer can arise from multiple and interacting environmental factors in addition to family history.

6.1 Improvement Recommendations Overall our pipeline was composed of multiple modules where the design could be improved foo integration and speed. AWS provides us with many tools that we could adopt at a higher level than we are currently.

1. The three functional parts of the workflow were constructed individually and there is room to improve the integration and resiliency for the end­to­end process.

2. Some of the steps require manual action such as scheduling and initiating the data acquisition workflow. We could envision a process where a new genome is detected and then automatically processed by the pipeline. Processing trials would give us an informed idea on how long processing would take and this could be provided back to the user.

3. Additional effort and focus is required to deliver a rich, intuitive and engaging user experience. The filtered mutations file (final output) included multiple dimensions and we could envision the use analytical package to explore relationships.

4. Classifier training is designed with scaling up in mind: services like Amazon Machine Learning for larger combined datasets. Include data for other cancers from TCGA while reducing bottleneck of merging different datasets together.

5. Possibilities exist in feature engineering to improve accuracy.

13

Page 15: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Citations Erik Garrison, Gabor Marth (2012) Haplotype­based variant detection from short­read sequencing (http://arxiv.org/abs/1207.3907) The Cancer Genome Atlas. Mutation Annotation File Format Description: https://tcga­data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/somatic_mutations/MutationAnnotationFormatDescription.pdf Langmead and Salzberg (2012) Fast gapped­read alignment with Bowtie 2. Nat Methods. 2012 Mar 4; 9(4): 357–359. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381

14

Page 16: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

References

AWS Configuration

Figure r1 ­ AWS ECS2 instances for Galaxy Server

Figure r2 ­ AWS EBS Storage

15

Page 17: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Figure r3 ­ Multi­node cluster configuration

Figure r4 ­ Galaxy Data Processing Pipeline running on AWS

16

Page 18: Mapping disease risk to individual genomes · 1.0 Introduction 1.1 Problem Overview Diseases caused by random irregularities in the genome (either congenital or acquired) are inherently

Figure r5 ­ Genome Processing Workflows on Galaxy AWS Server

17