NGS - QC & Dataformat

NGS

Data Formats & QC Analysis

Karan Veer SinghScientist, NBAGR

04/11/231

Sequence Formats All Sequence formats are ASCII text containing

sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence

Formats are designed to hold sequence data and other information about sequence

04/11/232

Why so many formats?

04/11/233

Created based on the information required for each step of analysis

Efficient Data & time management

Each Data formats vary in the information they contain

Types of sequence file formats

•Raw Sequence files •Co-ordinate files•Parameter files•Annotation files•Metadata files

Read output formats

454

Solexa/Illumina

SOLiD

04/11/234

454 output formats

.sff

.fna

.qual

04/11/235

Standard flowgram format

Illumina output formats

.seq.txt

.prb.txt

Illumina FASTQ (ASCII – 64 is Illumina score)

Qseq(ASCII – 64 is Phred score)

Illumina single line formatSCARF

04/11/236

Solexa Compact ASCII Read Format

Phred quality scores

ASCII value for h= 103 Quality of Base A at the position 1 = 103- 64 103- 64 = 39 Where 39 is the phred score

04/11/237

Illumina FastQ

SOLiD output format(s)

CSFASTA

04/11/238

color-space sequence reads in a fasta format

These reads can be retained and analyzed in color-space by software

The Format Conversion Tool offers options for cleaning of the CSFASTA files

Read Length

• Sanger reads lengths ~ 800-2000bp

• Generally we define short reads as anything below 200bp−Illumina (100bp – 250bp)−SoLID (75bp max)−Ion Torrent (200-300bp max – currently...)−Roche 454 – 400-800bp

• Even with these platforms it is cheaper to produce short reads (e.g. 50bp) rather than 100 or 200bp reads

• Diminishing returns:−For some applications 50bp is more than sufficient

−Resequencing of smaller organisms−Bacterial de-novo assembly −ChIP-Seq−Digital Gene Expression profiling−Bacterial RNA-seq

Common (“standard”) format for read alignments: Alignment/Assembly Format

SAM

BAM (= binary SAM)MAQ

04/11/2310

Sequencers & Sequence Assembly Packages

04/11/2311

Formats for Genome/Gene annotation

BED format (genome-browser tracks)

GFF format (gene/genome features)

BioXSD (XML) (any annotation; under development)

04/11/2312

If reads should be deposited in a public repository:

SRA (Short Read Archive) at NCBI

04/11/2313

For base-call data, “standard” FASTQ (Sanger, Phred)

For read alignments, SAM/BAM/MAQ format

For annotation results (e.g. GFF or BED format)

Points to remember on Data Formats

04/11/2314

QC analysis

04/11/2315

All platforms have errors

Illumina SoLID/ABI-Life Roche 454 Ion Torrent

1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more

identical DNA bases) causes higher number of (artificial) frameshifts

Illumina artefacts

under represented GC rich regions PCR Sequencing

GGC/GCC motif is associated with low quality and mismatches

Low quality reads < 20% phred score

Need for QC & Preprocessing

QC analysis of sequence data is extremely important for meaningful downstream analysis

To analyze problems in quality scores/ statistics of sequencing data

To check whether further analysis with sequence is possible

To remove redundancy (filtering)

To remove low quality reads from analysis

To remove adapter contamination

Highly efficient and fast processing tools are required to handle large volume of datasets

04/11/2318

The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification

Most of the programs available for downstream

analyses do not provide the utility for quality check and filtering of NGS data before processing

04/11/2319

Need for QC & Preprocessing

NGS QC Toolkit & FastQC

NGS QC Toolkit is for quality check and filtering of high-quality read

This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html

Application have been implemented in Perl programming language

QC of sequencing data generated using Roche 454 and Illumina platforms

Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools)

FastQC can be used only for preliminary analysis

04/11/2320

04/11/2321

04/11/2322

04/11/2323

NGSQC toolkit Output

04/11/2324

NGSQC toolkit Output

04/11/2325

Comparison - QC tools

FastQC Basic statistics Quality- Per base position Per Sequence Quality Distribution Nucleotide content per position Per sequence GC distribution Per base GC distribution Per base N content Length Distribution Overrepresented/ duplicated sequences K-mer content

04/11/2326

FastQC (Box-Whisker plot)

Y axis- Quality ScoreX axis- Base position

04/11/2327

2. Quality- Per base position04/11/2328

2. Quality- Per base position04/11/2329

3.Per Sequence Quality Distribution

04/11/2330

3. Per Sequence Quality Distribution

04/11/2331

4.Nucleotide content per position

04/11/2332

4. Nucleotide content per position

04/11/2333

5.Per sequence GC distribution

04/11/2334

5.Per sequence GC distribution

04/11/2335

6. Per base GC distribution04/11/2336

6. Per base GC distribution04/11/2337

7. Per base N content04/11/2338

7. Length Distribution04/11/2339

8. Kmer content04/11/2340

Any k-mer showing more than a 3 fold overall enrichment or a 5 fold enrichment at any given base position will be reported by this module.

9. Overrepresented/ duplicate sequences

The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences

Too many duplicate regions in the sequence will be due to sequencing problems

04/11/2341

This module will issue a warning if any sequence is found to represent more than 0.1% of the total.

QC Report Sequence StatisticsTotal No. of Sequences 6970943Avg. Sequence Length 54Max Sequence Length 54Min Sequence Length 54Total Sequence Length 376430922Total N bases 14254521% N bases 3.78676No of Sequences with Ns 278635% Sequences with Ns 3.99709

Quality StatisticsTotal HQ bases 334195496%HQ bases 88.78Total HQ reads 6350256%HQ reads 91.0961

04/11/2342

Alignment statistics

Education

NGS - QC & Dataformat