27
Corona Corona Lite Lite Introduction Introduction

Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

Corona Corona LiteLite IntroductionIntroduction

Page 2: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

2 © 2009 Applied Biosystems

Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction

• Workflow and Setup

• Matching pipeline

• Pairing pipeline

• Variation pipeline

Page 3: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

3 © 2009 Applied Biosystems

Corona Corona LiteLite OverviewOverview

Page 4: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

4 © 2009 Applied Biosystems

GlobalSETSGlobalSETS versus Corona versus Corona LiteLite

Category SOLiD™ Global SETS v3.0 Corona_Lite v4.2

Mapping Algorithm MapReads MapReads

Mapping scheme -

progressive

Yes, for max throughput. (default)

No

-full length with fixed number of mismatches

Yes Yes

Repeat Classifier Yes, new in v3.0 No

MatchingRepeat, Random,

and Consolidate

Yes, new in v3.0 No

SNP algorithm diBayes SNP caller

Multiple run combination

analysis

No Yes.

Integrated small indel

analysis

Jun-09 Yes

Page 5: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

5 © 2009 Applied Biosystems

Global SETS Versus Corona Global SETS Versus Corona LiteLite

Category SOLiD™ Global SETS v3.0 Corona_Lite v4.2

Matching

- Fasta-like .ma files- gff v.2

YesDefault

YesOptional using MaToGff.sh

Pairing

- .mates - gff v.2

YesDefault

YesOptional using MatesToGff.sh

SNP pipe

- SNP summary- Consensus base

sequence

Gff v.3Yes

SNP list text fileYes

Stats Files New format Old Stats file

Page 6: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

6 © 2009 Applied Biosystems

• Before you start

• Set the correct environment

• Make cmap file

• Validate reference

• Generate double encode reference

Corona Corona LiteLite SetupSetup

Page 7: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

7 © 2009 Applied Biosystems

• Set up environment variables

• for csh/tcsh:

• setenv CORONAROOT /share/apps/corona_lite

• source $CORONAROOT/etc/profile.d/corona.csh

• for sh/ksh/bash:

• export CORONAROOT=/share/apps/corona_lite

• source $CORONAROOT/etc/profile.d/corona.sh

Corona Corona LiteLite Setup Setup –– Environment VariablesEnvironment Variables

Page 8: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

8 © 2009 Applied Biosystems

1 chr1 /path/to/file/chr1.fa /path/to/file/de_chr1.fa2 chr2 /path/to/file/chr2.fa /path/to/file/de_chr2.fa3 chr3 /path/to/file/chr3.fa /path/to/file/de_chr3.fa4 chr4 /path/to/file/chr4.fa /path/to/file/de_chr4.fa

• Prepare the chromosome map file (tab-delimited):

• Chromosome ID

• Chromosome Name

• FASTA Reference

• Double-Encoded Reference

• For example

Corona Corona LiteLite Setup Setup –– Chromosome Map (Chromosome Map (cmapcmap) File) File

Page 9: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

9 © 2009 Applied Biosystems

• Validate reference

• reference_validation.pl –r chr1.fa –s 9999999999 –o chr1_validated.fa

• Generate double-encoded sequence

• encodeFasta.py -n -l sequence.fasta > de_sequence.fasta

Corona Corona LiteLite Setup Setup –– Validate and Double Encode RefValidate and Double Encode Ref

Page 10: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

10 © 2009 Applied Biosystems

Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction

• Workflow and Setup

• Matching pipeline

• Pairing pipeline

• Variation pipeline

Page 11: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

11 © 2009 Applied Biosystems

Things To ConsiderThings To Consider

• Number of hits to report (-z)

• Default is 10 per chromosome

• What does it mean if it hit 10 times?

• Recommended mismatches

• 2 for 25bp reads

• 3-4 for 35bp reads

• 4-6 for 50bp reads

• Can consider counting valid adjacent mismatches as 1 (-a=1)

Page 12: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

12 © 2009 Applied Biosystems

Matching Parameters (Required)Matching Parameters (Required)

• matching_large_genomes_cmap_save.pl

• -csfasta – F3 or R3 reads

• -dir – Output directory

• -cmap – Chromosome map file

• -t – Tag length

• -e – Number of errors allowed

Page 13: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

13 © 2009 Applied Biosystems

• matching_large_genomes_cmap_save.pl

• -p – Pattern mask for reads

• -a – 0 = no; 1 = valid adjacent errors; 2 = all adjacent errors: defaults to 0

• -z – Maximum number of hits per chromosome: defaults to 10

• -incremental – Remove reads that have already mapped

Matching Parameters (Optional)Matching Parameters (Optional)

Page 14: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

14 © 2009 Applied Biosystems

Submitting JobsSubmitting Jobs

• For PBS, use submit_scripts_to_PBS.pl

• Submission scripts exist for LSF, SGE and SMP machines

• Required Options

• -j – Job list file

• Optional Options

• -h – Usage description

• -q – Specify a queue

• -i – Interactive queue

Page 15: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

15 © 2009 Applied Biosystems

Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction

• Workflow and Setup

• Matching pipeline

• Pairing pipeline

• Variation pipeline

Page 16: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

16 © 2009 Applied Biosystems

• pairing_by_group.pl

• -F3 – F3 match file (.csfasta.ma)

• -R3 – R3 match file (.csfasta.ma)

• -e – Total errors allowed during mapping

• -output_dir – Output directory

• -find_pairing_dist – Flag for finding distance distribution

• Look at pairingDist.freq.binned file

Find Insert SizeFind Insert Size

Page 17: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

17 © 2009 Applied Biosystems

Insert Size DistributionInsert Size Distribution

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000

Page 18: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

18 © 2009 Applied Biosystems

• pairing_by_group.pl

• -F3 – F3 match file (.csfasta.ma)

• -R3 – R3 match file (.csfasta.ma)

• -e – Total errors allowed during mapping

• -output_dir – Output directory

• -min_insert_size – From distribution

• -max_insert_size – From distribution

• -ref – Multi FASTA reference file

Perform MatePerform Mate--pair Rescuepair Rescue

Page 19: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

19 © 2009 Applied Biosystems

MateMate--pair Descriptionspair Descriptions

• Mate-pairs are annotated with a three letter code

Page 20: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

20 © 2009 Applied Biosystems

Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction

• Workflow and Setup

• Matching pipeline

• Pairing pipeline

• Variation pipeline

Page 21: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

21 © 2009 Applied Biosystems

• Preparation

• Single tag: split_by_chromosome.pl

• -f – Unique match file (.unique.csfasta.ma)

• -c – Output chromosome directory

• Mate pair: multi_chr_pairing_parser.pl

• -mates – Mates file from pairing pipeline (.mates)

• -o_dir – Output directory

SNP PipelineSNP Pipeline

Page 22: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

22 © 2009 Applied Biosystems

• Consensus and SNP calling

• consensus_prep_and_wrapper_cmap_save_script.pl

• -mates/match_dir – Output from preparation step

• -cmap – Chromosome map file

• -mlf3/mlr3 – Tag length

• -ef3/er3 – Mismatches allowed

• -o_dir – Output directory

• -insert_start/_end – Pairing size for mate pair run

SNP PipelineSNP Pipeline

Page 23: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

23 © 2009 Applied Biosystems

SNP PipelineSNP Pipeline

• Consensus sequence generated from alignment to the reference sequence

• Files

• snps.txt

• snps_sorted.txt

• snp_probs.dat

• bp_consensus_confirmed_sequence_with_Ns.fasta

Page 24: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

24 © 2009 Applied Biosystems

Corona Corona LiteLite OverviewOverview

Page 25: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

25 © 2009 Applied Biosystems

QuizQuiz

• What do you need to do before running Corona Lite?

• What are the three main steps of Corona Lite?

• What is the workflow of each pipeline in Corona Lite?

• What is the meaning of the three letter annotations of mate-pairs (e.g., AAA, ABA, etc)?

• What are the main differences between Corona Lite and GlobalSETS?

Page 26: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

AppendixAppendix

Page 27: Corona Lite Introduction...General – Global SETS Versus Corona Lite Category SOLiD Global SETS v3.0 Corona_Lite v4.2 Supported OS Linux CentOS, Scyld Clusterware, PBS (Torque) Will

27 © 2009 Applied Biosystems

General General –– Global SETS Versus Corona Global SETS Versus Corona LiteLiteCategory SOLiD™ Global SETS v3.0 Corona_Lite v4.2

Supported OS Linux CentOS, ScyldClusterware, PBS (Torque)Will test LSF, PBS pro and SGE by June 2009

Linux, PBS, LSF, SGE

Programming language Java (algorithms in C++) Scripting languages (some algorithms in C++)

Analysis set up and

execution

Automatic through SETS GUI;Integrated command line

Integrated command. Can run batch mode.

Integrate with custom pipeline

Yes (SAI)GUI, and Command line

Yes, command line interface

Speed Optimized for compute performance for complex genome analysis

Support complex genome analysis

Warranty Yes No

AB support to end users Yes Yes

License Fee Comes with SOLiD 3 System Contact AB sales

Free open source