Whole exome sequencing(wes)

WHOLE EXOME SEQUENCING(WES) 4-11-2016

WHAT IS WES? Sequencing of the whole exome (protein coding regions of the genome)

Rabbani et al. reports that 85% of Mendelian disorders are linked to mutations in exonic regions. WES therefore can have great clinical utility.

Local connection: In 2010, Dr. Elizabeth Worthey of Medical College of Wisconsin sequenced an exome of a child with severe ulcerative colitis.

http://www.nature.com/jhg/journal/v59/n1/full/jhg2013114a.html

http://www.nature.com/gim/journal/v13/n3/full/gim9201146a.html

WES DATA All NGS assays use the same data storage formats for output. FASTQ, BAM

However, in RNA-Seq, we were interested in gene counts. In WES data, we are interested in the differences between the human reference sequence and the sample data.

We will annotate these differences to see if they are deleterious or not.

WES DATA• Genomes are getting cheaper and cheaper. • The SRA(NCBI Sequence Read Archive) has trillions of base pairs worth of data.

WHOLE EXOME PIPELINE• We will be using a program called SeqMule to automate the analysis of our whole exome data.

http://seqmule.openbioinformatics.org/en/latest/

PAIRED END SEQUENCING• NGS data is almost always in a paired-end format, which means that there are two files associated with a particular run. • For more information on the concept, I refer you to http://goo.gl/7FKH6j.

http://goo.gl/7FKH6j



STEP 1: DOWNLOAD DATA FROM SRA

• The HapMap venture sequenced many populations, including individuals of European ancestry from Utah. • One of these individuals, a child only known by the sample accession number NA12878 is probably the most sequenced individual on Earth. You will download and analyze this individual yourself.• For demonstration, we will be downloading another individual from the same cohort, named NA07000.

STEP 1: DOWNLOAD DATA FROM SRA• Go the SRA-DNAnexus website and enter SRR766039. • Find the SRA file and download it.

STEP 1: DOWNLOAD DATA FROM SRA

• Create a new folder in Linux and download the SRA file into the folder.• Commands: mkdir NA07000; cd NA07000wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR766/SRR766039/SRR766039.srafastq-dump --split-3 SRR766039.sra

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR766/SRR766039/SRR766039.sra



STEP 2: RUN FASTQC• While still in the NA007000 folder, run FASTQC to get quality metrics.

•

STEP 2: RUN FASTQC: FORWARD READ

• 101 bp sequences, good quality throughout.

STEP 2: RUN FASTQC: REVERSE READ

• Illumina instruments always have quality degradation at 3’ end of reverse reads.• This pair of FASTQs do not need trimming.

STEP 3: UNDERSTAND SEQMULE

• Once we have the FASTQ files, we will then use a program called SeqMule to:1. Align the reads to the reference genome. 2. De-duplicate the alignment to remove PCR

duplicate.3. Re-align the reads around insertions and deletions.4. Call variants, create VCF of consensus calls5. Produce plots of coverage.

STEP 3: UNDERSTAND SEQMULE

• Type: less ~/NGSTools/SeqMule/advanced_config to see the config file.

• In this file, these lines have 1 beside them for (Run=True):• 2P_bwamem=1 #BWA-MEM alignment• 3p_samtools_rmdup=1 #use MarkDuplicates from Picard tools to

mark duplicates• 4p_samtools_filter=1 #use 'samtools view' command to filter

reads under 30 MAPQ• 6px_gatklite_realign=1 #use GenomeAnalysisTKLite from GATK to

generate GATK intervals and then do realignment• 8p_gatk_HaplotypeCaller=1• 8p_samtools_mpileup=1• 8p_freebayes=1

STEP 4: RUN SEQMULE• While in the NA07000 folder, run this command:•seqmule pipeline -a SRR766039_1.fastq -b SRR766039_2.fastq -e -prefix NA07000 -threads 7 -capture default •-a: forward read• -b: reverse read• -e: exome data• -prefix: what you want to name the sample• -threads: how many cores you want for alignment. 7 is good enough.• -capture: default exome•Seqmule should begin to run without stopping immediately.• Wait 4 hours.

STEP 5: EXAMINE OUTPUT• Open the NA00070_report folder after completion. • Open the summary.html file to observe the results of the SeqMule run.

STEP 6: ANNOTATE CONSENSUS VCF

• Go to http://wannovar.usc.edu/ to use wANNOVAR, a web tool to annotate genomic variants. I have used custom filtering to filter out variants which are found in less than 5% of the population. Press Submit when ready.

http://wannovar.usc.edu/

http://wannovar.usc.edu/

STEP 7: DOWNLOAD CSV FILE AND FILTER

• When wANNOVAR is complete, you have two choices. 1. Download the full annotation in CSV or TXT format to

upload into Excel for manipulation.2. Download the Step 3 VCF (if you used Custom Filtering)

and re-annotate the VCF a second time to only annotate your filtered variants.

3. Use IGV to confirm variant depth by opening the realigned BAM file.

NOW IT’S YOUR TURN!• You will run sample NA12878 through our whole-exome pipeline.1. Create the sample folder.2. Download a high-quality exome run for NA12878 using these commands:

1. wget https://s3.amazonaws.com/bcbio_nextgen/NA12878-NGv3-LAB1360-A_1.fastq.gz2. wget https://s3.amazonaws.com/bcbio_nextgen/NA12878-NGv3-LAB1360-A_2.fastq.gz

3. Run FastQC on the reads.4. Run Seqmule: seqmule pipeline -a NA12878-NGv3-LAB1360-A_1.fastq.gz -

b NA12878-NGv3-LAB1360-A_1.fastq.gz -e -prefix NA12878 -capture default

5. Upload consensus VCF to wANNOVAR, open realigned BAM in IGV, and explore the most sequenced genome in the world!

https://s3.amazonaws.com/bcbio_nextgen/NA12878-NGv3-LAB1360-A_1.fastq.gz




HAPPY VARIANT HUNTING!

Data & Analytics

Whole exome sequencing(wes)