80
Kiran V Garimella ( [email protected] ), Mark A DePristo GENOME SEQUENCING AND ANALYSIS, BROAD INSTITUTE Research Informatics Group ELI LILLY AND COMPANY 20-Line Lifesavers: Coding simple solutions in the GATK Eli Lilly / September 14-15, 2011

20-Line Lifesavers: Coding simple solutions in the GATK

Embed Size (px)

DESCRIPTION

Taken from here:https://www.dropbox.com/sh/55nfktmn7lgai98/48sIHw8bzJ/kvg_20_line_lifesavers_mad_v2.pptx.pdfand uploaded to slide share for convenience. Credits to:Kiran V Garimella ([email protected] ), Mark A DePristo GENOME SEQUENCING AND ANALYSIS, BROAD INSTITUTE Research Informatics Group ELI LILLY AND COMPANY

Citation preview

Page 1: 20-Line Lifesavers: Coding simple solutions in the GATK

Kiran V Gar imel la (k i ran.gar imel la@gmai l .com), Mark A DePristo GENOME SEQUENCING AND ANALYSIS , BROAD INST ITUTE

Research Informat ics Group EL I L ILLY AND COMPANY

20-Line Lifesavers: "Coding simple solutions in the GATK

El i L i l ly / September 14-15, 2011

Page 2: 20-Line Lifesavers: Coding simple solutions in the GATK

Genome Analysis Toolkit (GATK)!\ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit\&

Noun&

1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!

Page 3: 20-Line Lifesavers: Coding simple solutions in the GATK

Genome Analysis Toolkit (GATK)!\ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit\&

Noun&

1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!

Most users think of the toolkit merely as a set of tools that implement our ideas…!

Page 4: 20-Line Lifesavers: Coding simple solutions in the GATK

Genome Analysis Toolkit (GATK)!\ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit\&

Noun&

1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!

… but the GATKʼs real power is in how easy it makes it to instantiate your ideas.!

This is what we will discuss today.!

Page 5: 20-Line Lifesavers: Coding simple solutions in the GATK

Convert to sam format, read the header, parse the read group info into a hash table keyed on the ID, loop over the reads, look up the read group id in the hash, find the platform unit tag, prepend it to the read name, convert back to BAM, reindex BAM.

Lines of Code: 500.

Some tasks are made difficult by the wrong tools

These BAMS have numeric, non-unique read ids that collide when you merge them!

How long will It take to fix?

With all apologies to Randall Munroe and XKCD&

All day!

Page 6: 20-Line Lifesavers: Coding simple solutions in the GATK

That same task, written in the GATK (20 lines of code)

package org.broadinstitute.sting.gatk.walkers.examples;

import net.sf.samtools.SAMFileWriter;import net.sf.samtools.SAMRecord;import org.broadinstitute.sting.commandline.Output;import org.broadinstitute.sting.gatk.contexts.ReferenceContext;import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;import org.broadinstitute.sting.gatk.walkers.ReadWalker;

public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out;

@Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read);

return null; }

@Override public Integer reduceInit() { return null; }

@Override public Integer reduce(Integer value, Integer sum) { return null; }}

Page 7: 20-Line Lifesavers: Coding simple solutions in the GATK

That same task, written in the GATK"(code that’s not filled in for you by the IDE – 5 lines)

package org.broadinstitute.sting.gatk.walkers.examples;

import net.sf.samtools.SAMFileWriter;import net.sf.samtools.SAMRecord;import org.broadinstitute.sting.commandline.Output;import org.broadinstitute.sting.gatk.contexts.ReferenceContext;import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;import org.broadinstitute.sting.gatk.walkers.ReadWalker;

public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out;

@Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read);

return null; }

@Override public Integer reduceInit() { return null; }

@Override public Integer reduce(Integer value, Integer sum) { return null; }}

Most of the code is boilerplate, and the IDE can fill it in for you. The amount of code you have to manually write is actually very small.!

Page 8: 20-Line Lifesavers: Coding simple solutions in the GATK

Write a GATK READwalker that modifies the read name and writes it out again.

Spend rest of time looking at lolCATs.

Lines of Code: 5.

Those tasks are simple when using the right tools…

Um, All day...

With all apologies to Randall Munroe and XKCD&

These BAMS have numeric, non-unique read ids that collide when you merge them!

How long will It take to fix?

Page 9: 20-Line Lifesavers: Coding simple solutions in the GATK

Hehe, I can haz cheezburger INDEED.

With all apologies to Randall Munroe and XKCD&

…though whether you’ll tell people that is up to you.

Page 10: 20-Line Lifesavers: Coding simple solutions in the GATK

We’re go ing to wr i te genu ine ly usefu l , dead l ine defeat ing, l i fesav ing too ls in < 20 l ines of code

Page 11: 20-Line Lifesavers: Coding simple solutions in the GATK

boxes like this&

Now we’ll go through a bunch of programs and learn to write new GATK tools by example

•  Weʼll setup the environment and look at five tutorial programs:&–  HelloRead: A simple walker that prints read information from a BAM&–  FixReadNames: Modify read names and emit results to a new BAM file&–  HelloVariant: A simple walker that prints variant information from a VCF&–  ComputeCoverageFromVCF: Computes a coverage histogram from a VCF&–  FindExclusiveVariants: Create a new VCF of variants exclusive to a sample&

•  Finished and commented versions are in the codebase at:&–  java/src/org/broadinstitute/sting/gatk/walkers/tutorial/&

•  How these tutorials work:&–  The icon enumerates the various steps in each tutorial.&–  The code that you should write at each step is in the IntelliJ window.&–  Text in give additional information on each step, emphasize

some information, and may clarify the command or code that you should write. &

3!

Page 12: 20-Line Lifesavers: Coding simple solutions in the GATK

Setting up for GATK development

Page 13: 20-Line Lifesavers: Coding simple solutions in the GATK

See our wiki resources

•  http://www.broadinstitute.org/gsa/wiki/index.php/Configuring_IntelliJ&

•  http://www.broadinstitute.org/gsa/wiki/index.php/Queue_with_IntelliJ_IDEA&

Page 14: 20-Line Lifesavers: Coding simple solutions in the GATK

Mechanics of a GATK “walker” "( a p r o g r a m t h a t “ w a l k s ” a l o n g a d a t a s e t i n a p r e s c r i b e d w a y )

Page 15: 20-Line Lifesavers: Coding simple solutions in the GATK

ReadWalker: “walks” over reads and allows a computation to be performed on each one

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)!(2)&(3)&(4)&(5)&

computation!order!

reference!

reads!

ReadWalker: process one read at a time!

Example use cases:&1.  Setting an extra metadata tag in a read&2.  Searching for mouse contaminant reads and excluding them&3.  Find or realign indels&

Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!

Page 16: 20-Line Lifesavers: Coding simple solutions in the GATK

ReadWalker: “walks” over reads and allows a computation to be performed on each one

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)&(2)!(3)&(4)&(5)&

computation!order!

reference!

reads!

ReadWalker: process one read at a time!

Example use cases:&1.  Setting an extra metadata tag in a read&2.  Searching for mouse contaminant reads and excluding them&3.  Find or realign indels&

Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!

Page 17: 20-Line Lifesavers: Coding simple solutions in the GATK

ReadWalker: “walks” over reads and allows a computation to be performed on each one

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)&(2)&(3)!(4)&(5)&

computation!order!

reference!

reads!

ReadWalker: process one read at a time!

Example use cases:&1.  Setting an extra metadata tag in a read&2.  Searching for mouse contaminant reads and excluding them&3.  Find or realign indels&

Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!

Page 18: 20-Line Lifesavers: Coding simple solutions in the GATK

LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one

reads!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)(2)(3)(4)(5) …&computation order! reference!

LocusWalker: process a single-base genomic position at a time!

Example use cases:&1.  Variant calling&2.  Depth of coverage calculations&3.  Compute properties of regions (GC content, read error rates)&

Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!

Page 19: 20-Line Lifesavers: Coding simple solutions in the GATK

LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one

reads!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)(2)(3)(4)(5) …&computation order! reference!

LocusWalker: process a single-base genomic position at a time!

Example use cases:&1.  Variant calling&2.  Depth of coverage calculations&3.  Compute properties of regions (GC content, read error rates)&

Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!

Page 20: 20-Line Lifesavers: Coding simple solutions in the GATK

LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one

reads!

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)(2)(3)(4)(5) …&computation order! reference!

LocusWalker: process a single-base genomic position at a time!

Example use cases:&1.  Variant calling&2.  Depth of coverage calculations&3.  Compute properties of regions (GC content, read error rates)&

Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!

Page 21: 20-Line Lifesavers: Coding simple solutions in the GATK

RodWalker: “walks” over positions in a file and allows a computation to be performed at each one

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)!computation order! reference!

RodWalker: process a genomic position from a file (e.g. VCF) at a time!

SampleA!

SampleB!

SampleC!

*!*&*&

*&*&

*&

(2)& (3)&(4)&

Example use cases:&1.  Variant calling&2.  Depth of coverage calculations&3.  Compute properties of regions (GC content, read error rates)&

Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!

variants!

Page 22: 20-Line Lifesavers: Coding simple solutions in the GATK

RodWalker: “walks” over positions in a file and allows a computation to be performed at each one

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)&computation order! reference!

RodWalker: process a genomic position from a file (e.g. VCF) at a time!

Example use cases:&1.  Variant filtering&2.  Computing metrics on variants&3.  Refining variant calls by enforcing additional constraints&

SampleA!

SampleB!

SampleC!

*&*!*!

*&*&

*&

(2)! (3)&(4)&

Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!

variants!

Page 23: 20-Line Lifesavers: Coding simple solutions in the GATK

RodWalker: “walks” over positions in a file and allows a computation to be performed at each one

TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!(1)&computation order! reference!

RodWalker: process a genomic position from a file (e.g. VCF) at a time!

SampleA!

SampleB!

SampleC!

*&*&*&

*!*&

*&

(2)! (3)!(4)&

Example use cases:&1.  Variant filtering&2.  Computing metrics on variants&3.  Refining variant calls by enforcing additional constraints&

Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!

variants!

Page 24: 20-Line Lifesavers: Coding simple solutions in the GATK

Writing your first GATK walkers

Page 25: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

1! Right-click on “walkers”, select New->Package&

Page 26: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

2!Type “examples” as the package name.&

3!Click “OK”.&

Page 27: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Right-click on “examples” and select New->Java class. Enter the name “HelloRead”.&

A file declaring the class and proper package name is created for you.&

4!

Page 28: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Add the following text to the class declaration:&

extends ReadWalker<Integer, Integer> {

This will tell the GATK that you are creating a program that iterates over all of the reads in a BAM file, one at a time.&

The “import” statement at the top will be added by the IDE.&

5!

Page 29: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

IntelliJ can detect what methods you need to implement in order to get your program working.&

Make sure your cursor is on the class declaration and type “Alt-Enter” to get the contextual action menu.&

Select “Implement Methods”.&

6!

Page 30: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Select all of the methods (usually, theyʼll already be selected, so you wonʼt need to do anything).&

7!

Click “OK”.&8!

Page 31: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

The three methods, map(), reduceInit(), and reduce() are now implemented with placeholder code.&

Page 32: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Declare a PrintStream and mark it with the @Output annotation. This tells the GATK that weʼre going to channel our output through this object.&

Donʼt worry about instantiating it – the GATK will do that automatically.&

9!

Page 33: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

In your map() method, add a line of code that prints “Hello” and the name of the read:&

out.println(“Hello, ” + read.getReadName());

Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the methods you can call, and you can just select it from the list.&

10!

When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.&11!

Page 34: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Back in the terminal window, change to your gatk-lilly directory and type:&

ant dist

This will compile the GATK-Lilly codebase, including your new walker!&

12!

Page 35: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Itʼll take about a minute to compile.&

Page 36: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Run your code by entering the following command:&

java -jar dist/GenomeAnalysisTK.jar \ -T HelloRead \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \| less

Every walker must be provided with a reference fasta file.

13!

Page 37: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Your code is now running and saying “Hello” to every read in the file!

Page 38: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Letʼs add some information to the output. Add the line:&

out.println(“Hello, ” + read.getReadName() + “at ” + read.getReferenceName() + “:” + read.getAlignmentStart());

This will print out the read name, the contig name, and the starting position for the readʼs alignment.

14!

Page 39: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

Compile and run with a single command:&

ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T HelloRead \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \| less –S

(The && instructs the shell to proceed only if the previous command was successful. If the compilation fails, HelloRead will not be run.)

1!Compile and run with a single command:&

ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T HelloRead \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \| less –S

(The && instructs the shell to proceed only if the previous command was successful. If the compilation fails, HelloRead will not be run.)

15!

Page 40: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

The updated command is running and showing us the alignment position in addition to the read name!

Page 41: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

You can run on just a specific region by supplying the -L argument, and redirect the output to a separate file with the -o argument:&

java -jar dist/GenomeAnalysisTK.jar \ -T HelloRead \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ -L chr21:9411000-9411200 \ -o test.txt

No additional code is required on your part to enable this.

16!

Page 42: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 1: Hello, Read!

The resultant file, with reads from chr21:9,411,000-9,411,200 only.

Page 43: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Letʼs use what weʼve learned to write a program that can change read names like discussed earlier in this tutorial.

Page 44: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Now letʼs create a new example program called “FixReadNames”.

1!

Page 45: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Make FixReadNames a ReadWalker. 2!

Page 46: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

This time, weʼll emit a BAM file by directing the output to a SAMFileWriter object instead of a PrintStream.

3!

Page 47: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Change the read name, tacking on the platform unit information.

4!

Page 48: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Add the alignment to the output stream.

5!

Page 49: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Compile and run your code:&

ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T FixReadNames \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam \ -L chr21:9411000-9411200 \ -o test.bam

6!

Page 50: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

Run the following command to see your results:&

samtools view test.bam | less -S

7!

Page 51: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 2: Fix read names

All of the read names now have the platform unit prepended to them!

Page 52: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

This will be a larger example, introducing variant processing, map-reduce calculations, and the onTraversalDone() method. All code required is listed here.

Page 53: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

Weʼve created a new program called “HelloVariant”.

1!

Page 54: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

This program extends RodWalker<Integer, Integer>

2!

Page 55: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

Declares a PrintStream.3!

Page 56: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

In the map() function, weʼll loop over lines in a VCF file and print metadata from each record.

4!

Page 57: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

Return 1.&This will get passed to reduce() later.5!

Page 58: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

This gets called before the first reduce() call. By returning 0, we initialize the record counter.6!

Page 59: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

All of the return values from map() get passed to reduce(), one at a time. Here, we add value to sum, effectively counting all the calls to map().

7!

Page 60: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

The onTraversalDone() method runs after the computation is complete. Here, we print the total number of map() calls made.

8!

Page 61: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

Compile and run the HelloVariant walker, but this time, rather than specifying a BAM file with the -I argument, weʼll attach a VCF file:&

ant dist && java –jar dist/GenomeAnalysisTK.jar \ –T HelloVariant \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf

9!

Page 62: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 3: Hello, Variant!

The program prints out the reference allele, alternate allele, and locus for each VCF record, and finally prints out the number of records processed!

Page 63: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Letʼs continue exploring variant processing by taking a closer look at the VariantContext object, the programmatic representation of a VCF record.&

This program will compute a depth of coverage histogram using VCF metadata rather than a BAM file.

Page 64: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Create a new program called&

ComputeCoverageFromVCF

of type&

RodWalker<Integer, Integer>

with the usual&

@Output PrintStream out

declaration.&

1!

Page 65: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Add a command line argument with the following code:&

@Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false) public string SAMPLE;

This adds the command-line argument --sample (aka -sn) and stores the inputted value in the String variable SAMPLE.&

Weʼll use this to allow the user to specify whether they want to get coverage for a specific sample or all of the samples (by specifying no sample at all).&

2!

Page 66: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Declare a hashtable to store the coverage counts.& private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>();

A TreeMap is a special kind of hashtable that returns its keys in sorted order.&

3!

Page 67: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Loop over the variants. For each one, weʼll print the coverage observed. We also make sure that we get the coverage for the sample requested (if the user specified a sample name to the --sample argument), or for all samples (if the user specified no sample name at all).&

For every coverage level we observe, we increment the appropriate entry in the histogram object.&

4!

Page 68: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

In the onTraversalDone() method, weʼll loop over every coverage level in the histogram and output the depth and the number of times we observed that depth.&

5!

Page 69: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Compile and run:&

ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T ComputeCoverageFromVCF \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf \ -o histogram.txt

6!

Page 70: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 4: Compute depth of coverage from a VCF file

Two columns of information are printed. First column is the coverage level, second is the number of times that coverage level was observed!

Page 71: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

For our last example, weʼll write a simple program that can take an input VCF and write a new VCF containing only variants that are exclusive to one sample.&

Weʼll also introduce the initialize() method, which can be used to prepare the environment for the computation.&

Page 72: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

Create a new RodWalker called FindExclusiveVariants that has a command-line argument called “sample” (aka “sn”) of type String.&

Add an output stream, but rather than be of type PrintStream, make it of type VCFWriter. Weʼll use this to output a new VCF file based on the input VCF.&

1!

Page 73: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

The initialize() method is called first, before any of the map() or reduce() calls are made. It is useful for preparing the environment, writing headers, setting up variables, etc.&

Here, weʼll write a VCF header to the output stream. While weʼre free to add/remove header lines and samples, weʼll just copy the input fileʼs header to the output file.&

2!

Page 74: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

Loop over each record in the VCF, and each Genotype object contained within the VariantContext object. Check the genotypes of each sample and, if only our sample of interest is variant, output the record to the new VCF file.&

3!

Page 75: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

Compile and run:&

ant dist && java -jar dist/GenomeAnalysisTK.jar \ -T FindExclusiveVariants \ -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta \ -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf \ -sn 113N \ -o 113.exclusive.vcf

4!

Page 76: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

After the program completes, look at the output.5!

Page 77: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

You can scroll left and right with the arrow key, but letʼs clean up the output to make it easier to read. Supply this command instead:&grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S

6!

Page 78: 20-Line Lifesavers: Coding simple solutions in the GATK

Example 5: Find variants unique to a single sample

Observe how the third sample is variant and the other three samples are not. Our program is selecting only the variants that are exclusive to 113N!

Page 79: 20-Line Lifesavers: Coding simple solutions in the GATK

Conclusions

•  From the five example programs, we have learned how to:&–  configure IntelliJ for GATK development&–  create a new ReadWalker or RodWalker–  declare output streams (PrintStream, SAMFileWriter, VCFWriter)&–  access and modify metadata in reads&–  access variants, samples, and metadata from a VCF file&–  declare command-line arguments&–  prepare for computations with the initialize() method&–  finish computations with the onTraversalDone() method&–  compile and run new GATK programs&

•  This tutorial is more than enough to get started with writing new and useful GATK programs&–  Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants

walkers are fully realized programs, ready to be used for real work.&–  You now have enough information to write your own somatic variant finder.&

Page 80: 20-Line Lifesavers: Coding simple solutions in the GATK

Additional resources

•  For more information on developing in the GATK and Java, see&–  http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development&–  http://download.oracle.com/javase/tutorial/java/index.html&

•  Explore the GATK Git repository at&–  https://github.com/broadgsa&–  https://github.com/signup/free (to add your own code, sign up for free account)&

•  To learn Git, the codebaseʼs version control system, see&–  http://gitref.org/&–  http://git-scm.com/course/svn.html (for those already familiar with SVN)&

•  Read our papers on the GATK framework and tools&–  http://genome.cshlp.org/content/20/9/1297.long&–  http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html&

•  Fore more guidance, feel free to look at other programs in the GATK&–  Every program is a tutorial!&