34
Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

Accessingthe1000Genomes Data

PaulFlicek EuropeanBioinformaMcsInsMtute

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

www1000genomesorg

www1000genomesorg

      

Introduction The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations which in turn will empower association studies to identify disease-causing genes The project now has data and variant genotypes for more than 1000 individuals in 14 populations The ftp site contains more than 120Tbytes of data in 200000 files

DATA TYPE FILE FORMAT SIZE

sequence FASTQ 43 Tbases raw sequence

alignment BAM 56 Tbytes of BAM files

variants VCF 389M SNPs ~47M short indels

Discoverability

filter out high volume results like bam and fastq files

We also have various routes for users to discover new data

bull Website httpwww1000genomesorgannouncements bull Twitter 1000genomes bull RSS httpwww1000genomesorgannouncementsrssxml bull Email 1000announce1000genomesorg

Sequence alignment and variant data is made available as quickly as possible through the project ftp site (ftpftp1000genomesebiacukvol1ftp | ftpftp-tracencbinihgov1000genomes) With more than 200000 files though discovering new data can be difficult

The ftp site has a index updated nightly This index is searchable from our website httpwww1000genomeorgftpsearch

The search al lows users to specify which ftp site to get paths t o t o g e t m d 5 checksums and also

1000 Genomes Project Resources L Clarke H Zheng-Bradley R Smith E Kuleshea I Toneva B Vaughan P Flicek and 1000 Genomes Consortium European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK

Visualization httpbrowser1000genomesorg The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls We provide rapid access to project variant calls through the browser before they become available via dbSNP and DGVa

Tracks of 1000 genomes variants by population can be viewed in the location page

A list of variants can be obtained for any given transcript In addition to basic information about a variant PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant

Allele frequency for individual variants in different populations is displayed on the lsquoPopulation Geneticsrsquo page

Users can Attach remote files as custom tracks In example below the HG00120 track is 1000 Genomes bam file added to the browser

EMBL- EBI Wellcome Trust Genome Campus Laura Clarke HinxtonEBI Cambridge CB10 1SD UK

lauraebiacuk

Accessibility httpbrowser1000genomesorgtoolshtml The project provides several tools to help users access and interpret the data provided

Variant Effect Predictor The predictor takes a list of variant positions and alleles and predicts the effects of each of these on any overlapping features (transcripts regulatory features) annotated in Ensembl An example output is shown below

Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they donrsquot needl samples populations you choose Below is the Data Slicer input interface

Variation Pattern Finder bull The Variation Pattern Finder (VPF) allows one to look for patterns of

shared variation between individuals in the a VCF file bull Within a vcf file different samples have different combination of variation

genotypes The VPF looks for distinct variation combinations within a user specifed region shared by different individuals

bull The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes

Acknowledgements We would like to thank the Ensembl variation team for all their help particularly Will McLaren and Graham Ritchie Funding The Wellcome Trust

T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 httpwwwebiacuk

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 2: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

www1000genomesorg

www1000genomesorg

      

Introduction The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations which in turn will empower association studies to identify disease-causing genes The project now has data and variant genotypes for more than 1000 individuals in 14 populations The ftp site contains more than 120Tbytes of data in 200000 files

DATA TYPE FILE FORMAT SIZE

sequence FASTQ 43 Tbases raw sequence

alignment BAM 56 Tbytes of BAM files

variants VCF 389M SNPs ~47M short indels

Discoverability

filter out high volume results like bam and fastq files

We also have various routes for users to discover new data

bull Website httpwww1000genomesorgannouncements bull Twitter 1000genomes bull RSS httpwww1000genomesorgannouncementsrssxml bull Email 1000announce1000genomesorg

Sequence alignment and variant data is made available as quickly as possible through the project ftp site (ftpftp1000genomesebiacukvol1ftp | ftpftp-tracencbinihgov1000genomes) With more than 200000 files though discovering new data can be difficult

The ftp site has a index updated nightly This index is searchable from our website httpwww1000genomeorgftpsearch

The search al lows users to specify which ftp site to get paths t o t o g e t m d 5 checksums and also

1000 Genomes Project Resources L Clarke H Zheng-Bradley R Smith E Kuleshea I Toneva B Vaughan P Flicek and 1000 Genomes Consortium European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK

Visualization httpbrowser1000genomesorg The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls We provide rapid access to project variant calls through the browser before they become available via dbSNP and DGVa

Tracks of 1000 genomes variants by population can be viewed in the location page

A list of variants can be obtained for any given transcript In addition to basic information about a variant PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant

Allele frequency for individual variants in different populations is displayed on the lsquoPopulation Geneticsrsquo page

Users can Attach remote files as custom tracks In example below the HG00120 track is 1000 Genomes bam file added to the browser

EMBL- EBI Wellcome Trust Genome Campus Laura Clarke HinxtonEBI Cambridge CB10 1SD UK

lauraebiacuk

Accessibility httpbrowser1000genomesorgtoolshtml The project provides several tools to help users access and interpret the data provided

Variant Effect Predictor The predictor takes a list of variant positions and alleles and predicts the effects of each of these on any overlapping features (transcripts regulatory features) annotated in Ensembl An example output is shown below

Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they donrsquot needl samples populations you choose Below is the Data Slicer input interface

Variation Pattern Finder bull The Variation Pattern Finder (VPF) allows one to look for patterns of

shared variation between individuals in the a VCF file bull Within a vcf file different samples have different combination of variation

genotypes The VPF looks for distinct variation combinations within a user specifed region shared by different individuals

bull The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes

Acknowledgements We would like to thank the Ensembl variation team for all their help particularly Will McLaren and Graham Ritchie Funding The Wellcome Trust

T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 httpwwwebiacuk

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 3: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

www1000genomesorg

www1000genomesorg

      

Introduction The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations which in turn will empower association studies to identify disease-causing genes The project now has data and variant genotypes for more than 1000 individuals in 14 populations The ftp site contains more than 120Tbytes of data in 200000 files

DATA TYPE FILE FORMAT SIZE

sequence FASTQ 43 Tbases raw sequence

alignment BAM 56 Tbytes of BAM files

variants VCF 389M SNPs ~47M short indels

Discoverability

filter out high volume results like bam and fastq files

We also have various routes for users to discover new data

bull Website httpwww1000genomesorgannouncements bull Twitter 1000genomes bull RSS httpwww1000genomesorgannouncementsrssxml bull Email 1000announce1000genomesorg

Sequence alignment and variant data is made available as quickly as possible through the project ftp site (ftpftp1000genomesebiacukvol1ftp | ftpftp-tracencbinihgov1000genomes) With more than 200000 files though discovering new data can be difficult

The ftp site has a index updated nightly This index is searchable from our website httpwww1000genomeorgftpsearch

The search al lows users to specify which ftp site to get paths t o t o g e t m d 5 checksums and also

1000 Genomes Project Resources L Clarke H Zheng-Bradley R Smith E Kuleshea I Toneva B Vaughan P Flicek and 1000 Genomes Consortium European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK

Visualization httpbrowser1000genomesorg The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls We provide rapid access to project variant calls through the browser before they become available via dbSNP and DGVa

Tracks of 1000 genomes variants by population can be viewed in the location page

A list of variants can be obtained for any given transcript In addition to basic information about a variant PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant

Allele frequency for individual variants in different populations is displayed on the lsquoPopulation Geneticsrsquo page

Users can Attach remote files as custom tracks In example below the HG00120 track is 1000 Genomes bam file added to the browser

EMBL- EBI Wellcome Trust Genome Campus Laura Clarke HinxtonEBI Cambridge CB10 1SD UK

lauraebiacuk

Accessibility httpbrowser1000genomesorgtoolshtml The project provides several tools to help users access and interpret the data provided

Variant Effect Predictor The predictor takes a list of variant positions and alleles and predicts the effects of each of these on any overlapping features (transcripts regulatory features) annotated in Ensembl An example output is shown below

Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they donrsquot needl samples populations you choose Below is the Data Slicer input interface

Variation Pattern Finder bull The Variation Pattern Finder (VPF) allows one to look for patterns of

shared variation between individuals in the a VCF file bull Within a vcf file different samples have different combination of variation

genotypes The VPF looks for distinct variation combinations within a user specifed region shared by different individuals

bull The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes

Acknowledgements We would like to thank the Ensembl variation team for all their help particularly Will McLaren and Graham Ritchie Funding The Wellcome Trust

T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 httpwwwebiacuk

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 4: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

www1000genomesorg

      

Introduction The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations which in turn will empower association studies to identify disease-causing genes The project now has data and variant genotypes for more than 1000 individuals in 14 populations The ftp site contains more than 120Tbytes of data in 200000 files

DATA TYPE FILE FORMAT SIZE

sequence FASTQ 43 Tbases raw sequence

alignment BAM 56 Tbytes of BAM files

variants VCF 389M SNPs ~47M short indels

Discoverability

filter out high volume results like bam and fastq files

We also have various routes for users to discover new data

bull Website httpwww1000genomesorgannouncements bull Twitter 1000genomes bull RSS httpwww1000genomesorgannouncementsrssxml bull Email 1000announce1000genomesorg

Sequence alignment and variant data is made available as quickly as possible through the project ftp site (ftpftp1000genomesebiacukvol1ftp | ftpftp-tracencbinihgov1000genomes) With more than 200000 files though discovering new data can be difficult

The ftp site has a index updated nightly This index is searchable from our website httpwww1000genomeorgftpsearch

The search al lows users to specify which ftp site to get paths t o t o g e t m d 5 checksums and also

1000 Genomes Project Resources L Clarke H Zheng-Bradley R Smith E Kuleshea I Toneva B Vaughan P Flicek and 1000 Genomes Consortium European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK

Visualization httpbrowser1000genomesorg The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls We provide rapid access to project variant calls through the browser before they become available via dbSNP and DGVa

Tracks of 1000 genomes variants by population can be viewed in the location page

A list of variants can be obtained for any given transcript In addition to basic information about a variant PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant

Allele frequency for individual variants in different populations is displayed on the lsquoPopulation Geneticsrsquo page

Users can Attach remote files as custom tracks In example below the HG00120 track is 1000 Genomes bam file added to the browser

EMBL- EBI Wellcome Trust Genome Campus Laura Clarke HinxtonEBI Cambridge CB10 1SD UK

lauraebiacuk

Accessibility httpbrowser1000genomesorgtoolshtml The project provides several tools to help users access and interpret the data provided

Variant Effect Predictor The predictor takes a list of variant positions and alleles and predicts the effects of each of these on any overlapping features (transcripts regulatory features) annotated in Ensembl An example output is shown below

Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they donrsquot needl samples populations you choose Below is the Data Slicer input interface

Variation Pattern Finder bull The Variation Pattern Finder (VPF) allows one to look for patterns of

shared variation between individuals in the a VCF file bull Within a vcf file different samples have different combination of variation

genotypes The VPF looks for distinct variation combinations within a user specifed region shared by different individuals

bull The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes

Acknowledgements We would like to thank the Ensembl variation team for all their help particularly Will McLaren and Graham Ritchie Funding The Wellcome Trust

T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 httpwwwebiacuk

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 5: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

      

Introduction The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations which in turn will empower association studies to identify disease-causing genes The project now has data and variant genotypes for more than 1000 individuals in 14 populations The ftp site contains more than 120Tbytes of data in 200000 files

DATA TYPE FILE FORMAT SIZE

sequence FASTQ 43 Tbases raw sequence

alignment BAM 56 Tbytes of BAM files

variants VCF 389M SNPs ~47M short indels

Discoverability

filter out high volume results like bam and fastq files

We also have various routes for users to discover new data

bull Website httpwww1000genomesorgannouncements bull Twitter 1000genomes bull RSS httpwww1000genomesorgannouncementsrssxml bull Email 1000announce1000genomesorg

Sequence alignment and variant data is made available as quickly as possible through the project ftp site (ftpftp1000genomesebiacukvol1ftp | ftpftp-tracencbinihgov1000genomes) With more than 200000 files though discovering new data can be difficult

The ftp site has a index updated nightly This index is searchable from our website httpwww1000genomeorgftpsearch

The search al lows users to specify which ftp site to get paths t o t o g e t m d 5 checksums and also

1000 Genomes Project Resources L Clarke H Zheng-Bradley R Smith E Kuleshea I Toneva B Vaughan P Flicek and 1000 Genomes Consortium European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK

Visualization httpbrowser1000genomesorg The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls We provide rapid access to project variant calls through the browser before they become available via dbSNP and DGVa

Tracks of 1000 genomes variants by population can be viewed in the location page

A list of variants can be obtained for any given transcript In addition to basic information about a variant PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant

Allele frequency for individual variants in different populations is displayed on the lsquoPopulation Geneticsrsquo page

Users can Attach remote files as custom tracks In example below the HG00120 track is 1000 Genomes bam file added to the browser

EMBL- EBI Wellcome Trust Genome Campus Laura Clarke HinxtonEBI Cambridge CB10 1SD UK

lauraebiacuk

Accessibility httpbrowser1000genomesorgtoolshtml The project provides several tools to help users access and interpret the data provided

Variant Effect Predictor The predictor takes a list of variant positions and alleles and predicts the effects of each of these on any overlapping features (transcripts regulatory features) annotated in Ensembl An example output is shown below

Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they donrsquot needl samples populations you choose Below is the Data Slicer input interface

Variation Pattern Finder bull The Variation Pattern Finder (VPF) allows one to look for patterns of

shared variation between individuals in the a VCF file bull Within a vcf file different samples have different combination of variation

genotypes The VPF looks for distinct variation combinations within a user specifed region shared by different individuals

bull The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes

Acknowledgements We would like to thank the Ensembl variation team for all their help particularly Will McLaren and Graham Ritchie Funding The Wellcome Trust

T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 httpwwwebiacuk

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 6: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 7: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

gtpgtp1000genomesebiacuk gtpgtp‐tracencbinihgov1000genomesgtp

SequencesampalignmentsbysampleID

DatasetstoaccompanythepilotdatapublicaMon

Currentandarchivedatasetreleases

Pre‐releasedatasetsandprojectworkingmaterials

SitedocumentaMon

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 8: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

[1424 1472009 Bioinformatics-btp352tex] Page 2078 2078ndash2079

BIOINFORMATICS APPLICATIONS NOTE Vol 25 no 16 2009 pages 2078ndash2079doi101093bioinformaticsbtp352

Sequence analysis

and thus provides universal tools for processing read alignmentsAvailability httpsamtoolssourceforgenetContact rdsangeracuk

1 INTRODUCTIONWith the advent of novel sequencing technologies such asIlluminaSolexa ABSOLiD and Roche454 (Mardis 2008) avariety of new alignment tools (Langmead et al 2009 Li et al2008) have been designed to realize efficient read mapping againstlarge reference sequences including the human genome These toolsgenerate alignments in different formats however complicatingdownstream processing A common alignment format that supportsall sequence types and aligners creates a well-defined interfacebetween alignment and downstream analyses including variantdetection genotyping and assembly

The Sequence AlignmentMap (SAM) format is designed toachieve this goal It supports single- and paired-end reads andcombining reads of different types including color space reads fromABSOLiD It is designed to scale to alignment sets of 1011 or morebase pairs which is typical for the deep resequencing of one humanindividual

In this article we present an overview of the SAM format andbriefly introduce the companion SAMtools software package Adetailed format specification and the complete documentation ofSAMtools are available at the SAMtools web site

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as Joint First Authors

on the field) if the corresponding information is unavailable The optionalfields are presented as key-value pairs in the format of TAGTYPEVALUEThey store extra information from the platform or aligner For example thelsquoRGrsquo tag keeps the lsquoread grouprsquo information for each read In combinationwith the lsquoRGrsquo header lines this tag allows each read to be labeled withmetadata about its origin sequencing center and library The SAM formatspecification gives a detailed description of each field and the predefinedTAGs

212 Extended CIGAR The standard CIGAR description of pairwisealignment defines three operations lsquoMrsquo for matchmismatch lsquoIrsquo for insertioncompared with the reference and lsquoDrsquo for deletion The extended CIGARproposed in SAM added four more operations lsquoNrsquo for skipped bases onthe reference lsquoSrsquo for soft clipping lsquoHrsquo for hard clipping and lsquoPrsquo for paddingThese support splicing clipping multi-part and padded alignments Figure 1shows examples of CIGAR strings for different types of alignments

213 Binary AlignmentMap format To improve the performance wedesigned a companion format Binary AlignmentMap (BAM) which is thebinary representation of SAM and keeps exactly the same information asSAM BAM is compressed by the BGZF library a generic library developedby us to achieve fast random access in a zlib-compatible compressed fileAn example alignment of 112 Gbp of Illumina GA data requires 116 GB ofdisk space (10 byte per input base) including sequences base qualities andall the meta information generated by MAQ Most of this space is used tostore the base qualities

214 Sorting and indexing A SAMBAM file can be unsorted but sortingby coordinate is used to streamline data processing and to avoid loadingextra alignments into memory A position-sorted BAM file can be indexedWe combine the UCSC binning scheme (Kent et al 2002) and simplelinear indexing to achieve fast random retrieval of alignments overlapping a

copy 2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1226 572011 Bioinformatics-btr330tex] Page 2156 2156ndash2158

ABSTRACTSummary The variant call format (VCF) is a generic format forstoring DNA polymorphism data such as SNPs insertions deletionsand structural variants together with rich annotations VCF is usuallystored in a compressed manner and can be indexed for fast dataretrieval of variants from a range of positions on the referencegenome The format was developed for the 1000 Genomes Projectand has also been adopted by other projects such as UK10KdbSNP and the NHLBI Exome Project VCFtools is a software suitethat implements various utilities for processing VCF files includingvalidation merging comparing and also provides a general Perl APIAvailability httpvcftoolssourceforgenetContact rdsangeracuk

Received on October 28 2010 revised on May 4 2011 acceptedon May 28 2011

1 INTRODUCTIONOne of the main uses of next-generation sequencing is to discovervariation among large populations of related samples Recentlya format for storing next-generation read alignments has beenstandardized by the SAMBAM file format specification (Li et al2009) This has significantly improved the interoperability of next-generation tools for alignment visualization and variant calling Wepropose the variant call format (VCF) as a standardized format forstoring the most prevalent types of sequence variation includingSNPs indels and larger structural variants together with richannotations The format was developed with the primary intentionto represent human genetic variation but its use is not restrictedto diploid genomes and can be used in different contexts as wellIts flexibility and user extensibility allows representation of a widevariety of genomic variation with respect to a single referencesequence

lowastTo whom correspondence should be addresseddaggerThe authors wish it to be known that in their opinion the first two authorsshould be regarded as joint First AuthorsDaggerhttpwww1000genomesorg

Although generic feature format (GFF) has recently been extendedto standardize storage of variant information in genome variantformat (GVF) (Reese et al 2010) this is not tailored for storinginformation across many samples We have designed the VCFformat to be scalable so as to encompass millions of sites withgenotype data and annotations from thousands of samples We haveadopted a textual encoding with complementary indexing to alloweasy generation of the files while maintaining fast data accessIn this article we present an overview of the VCF and brieflyintroduce the companion VCFtools software package A detailedformat specification and the complete documentation of VCFtoolsare available at the VCFtools web site

2 METHODS

21 The VCF211 Overview of the VCF A VCF file (Fig 1a) consists of a headersection and a data section The header contains an arbitrary number of meta-information lines each starting with characters lsquorsquo and a TAB delimitedfield definition line starting with a single lsquorsquocharacter The meta-informationheader lines provide a standardized description of tags and annotations usedin the data section The use of meta-information allows the informationstored within a VCF file to be tailored to the dataset in question It canbe also used to provide information about the means of file creation dateof creation version of the reference sequence software used and any otherinformation relevant to the history of the file The field definition line nameseight mandatory columns corresponding to data columns representing thechromosome (CHROM) a 1-based position of the start of the variant (POS)unique identifiers of the variant (ID) the reference allele (REF) a commaseparated list of alternate non-reference alleles (ALT) a phred-scaled qualityscore (QUAL) site filtering information (FILTER) and a semicolon separatedlist of additional user extensible annotation (INFO) In addition if samplesare present in the file the mandatory header columns are followed by aFORMAT column and an arbitrary number of sample IDs that define thesamples included in the VCF file The FORMAT column is used to definethe information contained within each subsequent genotype column whichconsists of a colon separated list of fields For example the FORMAT fieldGTGQDP in the fourth data entry of Figure 1a indicates that the subsequententries contain information regarding the genotype genotype quality andread depth for each sample All data lines are TAB delimited and the numberof fields in each data line must match the number of fields in the header lineIt is strongly recommended that all annotation tags used are declared in theVCF header section

copy The Author(s) 2011 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc25) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

at Genom

e Research Ltd on Septem

ber 19 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

[1450 222011 Bioinformatics-btq671tex] Page 718 718ndash719

BIOINFORMATICS APPLICATIONS NOTE Vol 27 no 5 2011 pages 718ndash719doi101093bioinformaticsbtq671

Sequence analysis Advance Access publication January 5 2011

genome viewers to support huge data files and remote custom tracksover networksAvailability and Implementation httpsamtoolssourceforgenetContact henglibroadinstituteorg

Received on October 14 2010 revised and accepted on December1 2010

1 INTRODUCTIONWhen we examine local genomic features on the command lineor via a genome viewer we frequently need to perform intervalqueries retrieving features overlapping specified regions We mayread through the entire data file if we only perform interval queriesa few times or preload the file in memory if interval queries arefrequently performed (eg by genome viewers) However readingthe entire file makes both strategies inefficient given huge datasetsA solution to the efficiency problem is to build a database for thedata file (Kent et al 2002 Stein et al 2002) but this is notoptimal either because generic database indexing algorithms arenot specialized for biological data (Alekseyenko and Lee 2007) Thetechnical complexity of setting up databases and designing schemasalso hampers the adoption of the database approach for an averageend user

Under the circumstances a few specialized binary formatsincluding bigBedbigWig (Kent et al 2010) and BAM (Li et al2009) have been developed very recently to achieve efficient randomaccess to huge datasets while supporting data compression andremote file access which greatly helps routine data processing anddata visualization At present these advanced indexing techniquesare only applied to BED Wiggle and BAM Nonetheless as mostTAB-delimited biological data formats (eg PSL GFF SAM VCFand many UCSC database dumps) contain chromosomal positionsone can imagine that a generic tool that indexes for all these formatsis feasible And this tool is Tabix In this article I will explain thetechnical advances in Tabix indexing describe the algorithm andevaluate its performance on biological data

sort The sorted file should then be compressed with the bgzip programthat comes with the Tabix package Bgzip compresses the data file in theBGZF format which is the concatenation of a series of gzip blocks with eachblock holding at most 216 bytes of uncompressed data In the compressedfile each uncompressed byte in the text data file is assigned a unique 64-bitvirtual file offset where the higher 48 bits keep the real file offset of the startof the gzip block the byte falls in and the lower 16 bits store the offset ofthe byte inside the gzip block Given a virtual file offset one can directlyseek to the start of the gzip block using the higher 48 bits decompressthe block and retrieve the byte pointed by the lower 16 bits of the virtualoffset Random access can thus be achieved without the help of additionalindex structures As gzip works with concatenated gzip files it can alsoseamlessly decompress a BGZF file The detailed description of the BGZFformat is described in the SAM specification

22 Coupled binning and linear indicesTabix builds two types of indices for a data file a binning index and alinear index We can actually achieve fast retrieval with only one of themHowever using the binning index alone may incur many unnecessary seekcalls while using the linear index alone has bad worst-case performance(when some records span very long distances) Using them together avoidstheir weakness

221 The binning index The basic idea of binning is to cluster recordsinto large intervals called bins A record is assigned to bin k if k is the binof the smallest size that fully contains the record For each bin we keepin the index file the virtual file offsets of all records assigned to the binWhen we search for records overlapping a query interval we first collectbins overlapping the interval and then test each record in the collected binsfor overlaps

In principle bins can be selected freely as long as each record can beassigned to a bin In the Tabix binning index we adopt a multilevel binningscheme where bins at same level are non-overlapping and of the same size InTabix each bin k 0lek le37449 represents a half-close-half-open interval[(kminusol)sl(kminusol +1)sl) where l=log2(7k+1)3 is the level of the binsl =229minus3l is the size of the bin at level l and ol = (23l minus1)7 is the offset at lIn this scheme bin 0 spans 512 Mb 1ndash8 span 64 Mb 9ndash72 8 Mb 73ndash5841 Mb 585ndash4680 128 kb and 4681ndash37449 span 16 kb intervals The schemeis very similar to the UCSC binning (Kent et al 2002) except that in UCSC0lek le4681 and therefore the smallest bin size is 128 kb

718 copy The Author 2011 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

at Genom

e Research Ltd on O

ctober 13 2011bioinform

aticsoxfordjournalsorgD

ownloaded from

DataformatsandkeytoolsHeng Li1dagger Bob Handsaker2dagger Alec Wysoker2 Tim Fennell2 Jue Ruan3 Nils Homer4 Gabor Marth5 Goncalo Abecasis6 Richard Durbin1lowast and 1000 Genome Project Data Processing Subgroup7

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA UK 2Broad Institute of MIT and Harvard Cambridge MA 02141 USA 3Beijing Institute of Genomics Chinese Academy of Science Beijing 100029 China 4Department of Computer Science University of California Los Angeles Los Angeles CA 90095 5Department of Biology Boston College Chestnut Hill MA 02467 6Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 USA and 7http1000genomesorg

Received on April 28 2009 revised on May 28 2009 accepted on May 30 2009

The Sequence AlignmentMap format and SAMtools

BAMalignmentfilesAdvance Access publication June 8 2009

Associate Editor Alfonso Valencia

ABSTRACT

Summary The Sequence AlignmentMap (SAM) format is a generic alignment format for storing read alignments against reference sequences supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms It is flexible in style compact in size efficient in random access and is the format in which alignments from the 1000 Genomes Project are released SAMtools implements various utilities for post-processing alignments in the SAM format such as indexing variant caller and alignment viewer

2 METHODS

21 The SAM format 211 Overview of the SAM format The SAM format consists of one header section and one alignment section The lines in the header section start with character lsquorsquo and lines in the alignment section do not All lines are TAB delimited An example is shown in Figure 1b

In SAM each alignment line has 11 mandatory fields and a variable number of optional fields The mandatory fields are briefly described in Table 1 They must be present but their value can be a lsquorsquoor a zero (depending

Vol 27 no 15 2011 pages 2156ndash2158 BIOINFORMATICS APPLICATIONS NOTE doi101093bioinformaticsbtr330

Sequence analysis Advance Access publication June 7 2011

The variant call format and VCFtools VCFvariantfiles Petr Danecek1dagger Adam Auton2dagger Goncalo Abecasis3 Cornelis A Albers1 Eric Banks4 Mark A DePristo4 Robert E Handsaker4 Gerton Lunter2 Gabor T Marth5 Stephen T Sherry6 Gilean McVean27 Richard Durbin1lowast and 1000 Genomes Project Analysis GroupDagger

1Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge CB10 1SA 2Wellcome Trust Centre for Human Genetics University of Oxford Oxford OX3 7BN UK 3Center for Statistical Genetics Department of Biostatistics University of Michigan Ann Arbor MI 48109 4Program in Medical and Population Genetics Broad Institute of MIT and Harvard Cambridge MA 02141 5Department of Biology Boston College MA 02467 6National Institutes of Health National Center for Biotechnology Information MD 20894 USA and 7Department of Statistics Tabix fast retrieval of sequence features from generic University of Oxford Oxford OX1 3TG UK Associate Editor John Quackenbush TAB-delimited files

Heng Li Program in Medical Population Genetics The Broad Institute of Harvard and MIT Cambridge MA 02142 USA Associate Editor Dmitrij Frishman

ABSTRACT 2 METHODS Summary Tabix is the first generic tool that indexes position sorted Tabix indexing is a generalization of BAM indexing for generic TAB-files in TAB-delimited formats such as GFF BED PSL SAM and delimited files It inherits all the advantages of BAM indexing including SQL export and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function

regions Tabix features include few seek function calls per query data calls per query

compression with gzip compatibility and direct FTPHTTP access Tabix is implemented as a free command-line tool as well as a library 21 Sorting and BGZF compression in C Java Perl and Python It is particularly useful for manually

Allindexedforfastretrieval

Before being indexed the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables and then by leftmost coordinate which can be done with the standard Unix

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 9: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

1000GenomesisintheAmazoncloud

1KGpilotcontent(BAM)isavailableat s31000genomess3amazonawscom YoucanseetheXMLat hOp1000genomess3amazonawscom

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 10: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 11: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

hOpbrowser1000genomesorg

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 12: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

 bull GenevariaMonzoom

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 13: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

 bull PopulaMon

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 14: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

  

bull SIFT bull PolyPhen

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 15: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

  

Fileuploadtoviewwith1000 Genomesdata

bull Supportspopularfiletypes ndash BAMBEDbedGraphBigWigGBrowseGeneric GFFGTFPSLVCFWIG

VCFmustbeindexed

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 16: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

UploadedVCF Example ComparisonofAugustcallsand technicalworking20110502_vqsr_phase1_wgs_snpsALLwgsphase1projectConsensussnpssitesvcfgz

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 17: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

1000GenomesBrowser

bull ForfurtherinformaMononthecapabiliMesof thebrowseranditsuseaOendtheEnsembl ldquoNewUsersrdquoWorkshoponSaturdayat1230

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 18: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

hOppilotbrowser1000genomesorg

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 19: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 20: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

hOpbrowser1000genomesorg

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 21: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

Toolspage

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 22: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

  

EnsemblVariantEffectorPredictor (VEP)

bull TakeslistofvariaMonandannotateswith respecttoEnsemblfeatures

bull ReturnswhethertheSNPhasbeenseeninthe 1000Genomesandifithasanrsnumber(if onehasbeenassigned)

bull ReturnsSIFTPolyPhenandCondelscores bull ExtensivefilteringopMonsbyMAFand populaMons

bull Webandcommandlineversions

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 23: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

Dataslicerforsubsetsofthedata

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 24: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

hOptracencbinlmnihgovTraces1kg_slicer

SlicedBAM tofiles

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 25: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

   

VariaMonPaOernFinder

bull hOpbrowser1000genomesorg Homo_sapiensUserDataVariaMonsMapVCF

bull VCFinput bull DiscoverspaOernsofSharedInheritance bull VariantswithfuncMonalconsequences considered

bull Weboutputwithcsvandexceldownloads

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 26: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

  

  

AccesstobackendEnsembldatabases

bull PublicMySQLdatabaseat ndash mysql‐db1000genomesorgport4272

bull FullprogrammaMcaccesswithEnsemblAPI ndash MoreinformaMonontheuseoftheEnsemblAPI attheEnsemblldquoAdvancedUsersrdquoWorkshop tomorrow

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 27: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

     

Dataaccess

bull GeneralinformaMon bull Fileaccess bull 1000GenomesBrowser bull Tools bull Wheretofindhelp

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg

Page 28: Accessing the 1000 Genomes Data - National Human Genome ...€¦ · Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

    

Credits amp Contact bull Eugene Kulesha Iliana Toneva Bren Vaughan bull Will McLaren Graham Ritchie Fiona Cunningham bull Laura Clarke Holly Zheng-Bradley Rick Smith

bull Steve Sherry Chunlin Xiao

For more information contact info1000genomesorg