Upload
karan-veer-singh
View
627
Download
2
Embed Size (px)
DESCRIPTION
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for NGS Data quality check and Dataformat of top sequencing machine
Citation preview
NGS
Data Formats & QC Analysis
Karan Veer SinghScientist, NBAGR
04/11/231
Sequence Formats All Sequence formats are ASCII text containing
sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence
Formats are designed to hold sequence data and other information about sequence
04/11/232
Why so many formats?
04/11/233
Created based on the information required for each step of analysis
Efficient Data & time management
Each Data formats vary in the information they contain
Types of sequence file formats
•Raw Sequence files •Co-ordinate files•Parameter files•Annotation files•Metadata files
Read output formats
454
Solexa/Illumina
SOLiD
04/11/234
454 output formats
.sff
.fna
.qual
04/11/235
Standard flowgram format
Illumina output formats
.seq.txt
.prb.txt
Illumina FASTQ (ASCII – 64 is Illumina score)
Qseq(ASCII – 64 is Phred score)
Illumina single line formatSCARF
04/11/236
Solexa Compact ASCII Read Format
Phred quality scores
ASCII value for h= 103 Quality of Base A at the position 1 = 103- 64 103- 64 = 39 Where 39 is the phred score
04/11/237
Illumina FastQ
SOLiD output format(s)
CSFASTA
04/11/238
color-space sequence reads in a fasta format
These reads can be retained and analyzed in color-space by software
The Format Conversion Tool offers options for cleaning of the CSFASTA files
Read Length
• Sanger reads lengths ~ 800-2000bp
• Generally we define short reads as anything below 200bp−Illumina (100bp – 250bp)−SoLID (75bp max)−Ion Torrent (200-300bp max – currently...)−Roche 454 – 400-800bp
• Even with these platforms it is cheaper to produce short reads (e.g. 50bp) rather than 100 or 200bp reads
• Diminishing returns:−For some applications 50bp is more than sufficient
−Resequencing of smaller organisms−Bacterial de-novo assembly −ChIP-Seq−Digital Gene Expression profiling−Bacterial RNA-seq
Common (“standard”) format for read alignments: Alignment/Assembly Format
SAM
BAM (= binary SAM)MAQ
04/11/2310
Sequencers & Sequence Assembly Packages
04/11/2311
Formats for Genome/Gene annotation
BED format (genome-browser tracks)
GFF format (gene/genome features)
BioXSD (XML) (any annotation; under development)
04/11/2312
If reads should be deposited in a public repository:
SRA (Short Read Archive) at NCBI
04/11/2313
For base-call data, “standard” FASTQ (Sanger, Phred)
For read alignments, SAM/BAM/MAQ format
For annotation results (e.g. GFF or BED format)
Points to remember on Data Formats
04/11/2314
QC analysis
04/11/2315
All platforms have errors
Illumina SoLID/ABI-Life Roche 454 Ion Torrent
1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more
identical DNA bases) causes higher number of (artificial) frameshifts
Illumina artefacts
under represented GC rich regions PCR Sequencing
GGC/GCC motif is associated with low quality and mismatches
Low quality reads < 20% phred score
Need for QC & Preprocessing
QC analysis of sequence data is extremely important for meaningful downstream analysis
To analyze problems in quality scores/ statistics of sequencing data
To check whether further analysis with sequence is possible
To remove redundancy (filtering)
To remove low quality reads from analysis
To remove adapter contamination
Highly efficient and fast processing tools are required to handle large volume of datasets
04/11/2318
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification
Most of the programs available for downstream
analyses do not provide the utility for quality check and filtering of NGS data before processing
04/11/2319
Need for QC & Preprocessing
NGS QC Toolkit & FastQC
NGS QC Toolkit is for quality check and filtering of high-quality read
This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html
Application have been implemented in Perl programming language
QC of sequencing data generated using Roche 454 and Illumina platforms
Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools)
FastQC can be used only for preliminary analysis
04/11/2320
04/11/2321
04/11/2322
04/11/2323
NGSQC toolkit Output
04/11/2324
NGSQC toolkit Output
04/11/2325
Comparison - QC tools
FastQC Basic statistics Quality- Per base position Per Sequence Quality Distribution Nucleotide content per position Per sequence GC distribution Per base GC distribution Per base N content Length Distribution Overrepresented/ duplicated sequences K-mer content
04/11/2326
FastQC (Box-Whisker plot)
Y axis- Quality ScoreX axis- Base position
04/11/2327
2. Quality- Per base position04/11/2328
2. Quality- Per base position04/11/2329
3.Per Sequence Quality Distribution
04/11/2330
3. Per Sequence Quality Distribution
04/11/2331
4.Nucleotide content per position
04/11/2332
4. Nucleotide content per position
04/11/2333
5.Per sequence GC distribution
04/11/2334
5.Per sequence GC distribution
04/11/2335
6. Per base GC distribution04/11/2336
6. Per base GC distribution04/11/2337
7. Per base N content04/11/2338
7. Length Distribution04/11/2339
8. Kmer content04/11/2340
Any k-mer showing more than a 3 fold overall enrichment or a 5 fold enrichment at any given base position will be reported by this module.
9. Overrepresented/ duplicate sequences
The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences
Too many duplicate regions in the sequence will be due to sequencing problems
04/11/2341
This module will issue a warning if any sequence is found to represent more than 0.1% of the total.
QC Report Sequence StatisticsTotal No. of Sequences 6970943Avg. Sequence Length 54Max Sequence Length 54Min Sequence Length 54Total Sequence Length 376430922Total N bases 14254521% N bases 3.78676No of Sequences with Ns 278635% Sequences with Ns 3.99709
Quality StatisticsTotal HQ bases 334195496%HQ bases 88.78Total HQ reads 6350256%HQ reads 91.0961
04/11/2342
Alignment statistics