Download pdf - presto Documentation - Read the Docs · presto Documentation, Release 0.5.3-2017.02.14 pRESTO is a toolkit for processing raw reads from high-throughput sequencing of B cell and T

presto DocumentationRelease 0.5.3-2017.02.14

Jason Anthony Vander Heiden

Feb 15, 2017

Getting Started

1 Overview 31.1 Scope and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Annotation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Download 7

3 Installation 93.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Workflows 134.1 Roche 454 BCR mRNA with Multiplexed Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Illumina MiSeq 2x250 BCR mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 UMI Barcoded Illumina MiSeq 2x250 BCR mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Fixing UMI Problems 315.1 Correcting misaligned V-segment primers and indels in UMI groups . . . . . . . . . . . . . . . . . . 315.2 Dealing with insufficient UMI diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Combining split UMIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.4 Estimating sequencing and PCR error rates with UMI data . . . . . . . . . . . . . . . . . . . . . . . 33

6 Miscellaneous Tasks 356.1 Importing data from SRA, ENA or GenBank into pRESTO . . . . . . . . . . . . . . . . . . . . . . . 356.2 Reducing file size for submission to IMGT/HighV-QUEST . . . . . . . . . . . . . . . . . . . . . . . 356.3 Subsetting sequence files by annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.4 Random sampling from sequence files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.5 Cleaning or removing poor quality sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.6 Assembling paired-end reads that do not overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.7 Assigning isotype annotations from the constant region sequence . . . . . . . . . . . . . . . . . . . 37

7 Commandline Usage 397.1 AlignSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 AssemblePairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.3 BuildConsensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

i

7.4 ClusterSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.5 CollapseSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.6 ConvertHeaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.7 EstimateError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.8 FilterSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.9 MaskPrimers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.10 PairSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.11 ParseHeaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.12 ParseLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.13 SplitSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8 API 798.1 presto.Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 presto.Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.3 presto.Commandline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.4 presto.IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.5 presto.Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868.6 presto.Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9 Release Notes 939.1 Version 0.5.3: February 14, 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2 Version 0.5.2: March 8, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.3 Version 0.5.1: December 4, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.4 Version 0.5.0: September 7, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.5 Version 0.4.8: September 7, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959.6 Version 0.4.7: June 5, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959.7 Version 0.4.6: May 13, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.8 Version 0.4.5: March 20, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.9 Version 0.4.4: June 10, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.10 Version 0.4.3: April 7, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.11 Version 0.4.2: March 20, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.12 Version 0.4.1: January 27, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.13 Version 0.4.0: September 30, 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.14 Version 0.3 (prerelease 6): August 13, 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009.15 Version 0.3 (prerelease 5): August 7, 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019.16 Version 0.3 (prerelease 4): May 18, 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

10 Contact 103

11 Citation 105

12 License 107

13 Indices 109

Python Module Index 111

ii

presto Documentation, Release 0.5.3-2017.02.14

pRESTO is a toolkit for processing raw reads from high-throughput sequencing of B cell and T cell repertoires.

Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of lym-phocyte repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface ofB cells and T cells. The REpertoire Sequencing TOolkit (pRESTO) is composed of a suite of utilities to handle allstages of sequence processing prior to germline segment assignment. pRESTO is designed to handle either singlereads or paired-end reads. It includes features for quality control, primer masking, annotation of reads with sequenceembedded barcodes, generation of unique molecular identifier (UMI) consensus sequences, assembly of paired-endreads and identification of duplicate sequences. Numerous options for sequence sorting, sampling and conversionoperations are also included.

Getting Started 1


2 Getting Started

CHAPTER 1

Overview

1.1 Scope and Features

pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences.The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather thanproviding a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible aspossible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future tech-nologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.

pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommandsproviding different behaviors. A brief description of each tool is shown in the table below.

Tool Subcommand DescriptionAlignSets Multiple aligns sets of sequences sharing the same annotation

muscle Uses the program MUSCLE to align readsoffset Uses a table of primer alignments to align the 5’ regiontable Creates a table of primer alignments for the offset subcommand

AssemblePairs Assembles paired-end reads into a complete sequencealign Assembles paired-end reads by aligning the sequence endsjoin Concatenates pair-end reads with intervening gapsreference Assembles paired-end reads using V-segment references

BuildConsensus Constructs UMI consensus sequencesClusterSets Clusters UMI read groups into smaller sub-groupsCollapseSeq Removes duplicate sequencesConvertHeaders Converts sequence headers to the pRESTO format

generic Converts sequence headers with an unknown annotation system454 Converts Roche 454 sequence headersgenbank Converts NCBI GenBank and RefSeq sequence headersillumina Converts Illumina sequence headersimgt Converts sequence headers output by IMGT/GENE-DB.sra Converts NCBI SRA sequence headers

Continued on next page

3


Table 1.1 – continued from previous pageTool Subcommand DescriptionEstimateError Estimates error rates for UMI dataFilterSeq Removes or modifies low quality reads

length Removes sequences under a defined lengthmaskqual Masks low Phred quality score positions with Nsmissing Removes sequences with a high number of Nsrepeats Removes sequences with long repeats of a single nucleotidequality Removes sequences with low Phred quality scorestrimqual Trims sequences to segments with high Phred quality scores

MaskPrimers Identifies and removes primer regions, MIDs and UMI barcodesalign Matches primers by local alignment and reorients sequencesscore Matches primers at a fixed user-defined start position

PairSeq Sorts paired-end reads and copies annotations between themParseHeaders Manipulates sequence annotations

add Adds a field and value annotation pair to all readscollapse Compresses a set of annotation fields into a single fieldcopy Copies values between annotations fieldsdelete Deletes an annotation from all readsexpand Expands an field with multiple values into separate annotationsrename Rename annotation fieldstable Outputs sequence annotations as a data table

ParseLog Converts the log output of pRESTO scripts into data tablesSplitSeq Performs conversion, sorting, and subsetting of sequence files

count Splits files into smaller filesgroup Splits files based on numerical or categorical annotationsample Randomly samples sequences from a filesamplepair Randomly samples paired-end reads from two filesselect Filters sequences based on annotationssort Sorts sequences based on annotations

1.2 Input and Output

All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allowspRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any stepswithin a pRESTO workflow can be exchanged for an alternate tool, if desired.

Each tool appends a specific suffix to its output files describing the step and output. For example, MaskPrimers willappend _primers-pass to the output file containing successfully aligned sequences and _primers-fail to thefile containing unaligned sequences.

See also:

Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for eachtool.

1.3 Annotation Scheme

The majority of pRESTO tools manipulate and add sequences-specific annotations as part of their processing functionsusing the scheme shown below. Each annotation is delimited using a reserved character (| by default), with theannotation field name and values separated by a second reserved character (= by default), and each value within a

4 Chapter 1. Overview


field is separated by a third reserved character (, by default). These annotations follow the sequence identifier, whichitself immediately follows the > (FASTA) or @ (FASTQ) symbol denoting the beginning of a new sequence entry. Thesequence identifier is given the reserved field name ID. To mitigate potential analysis errors, each tool in pRESTOannotates sequences by appending values to existing annotation fields when they exist, and will not overwrite or deleteannotations unless explicitly performed using the ParseHeaders tool. All reserved characters can be redefined usingthe command line options.

Listing 1.1: FASTA Annotation

>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA

Listing 1.2: FASTQ Annotation

@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA+!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo

See also:

• Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documen-tation for each tool.

• The ParseHeaders tool provides a number of options for manipulating annotations in the pRESTO format.

• The ConvertHeaders tool allows you convert several common annotation schemes into the pRESTO annotationformat.

1.3. Annotation Scheme 5


6 Chapter 1. Overview

CHAPTER 2

Download

The latest stable release of pRESTO may be downloaded from PyPI or Bitbucket.

Development versions and source code are available on Bitbucket.

7

https://pypi.python.org/pypi/presto

https://bitbucket.org/kleinstein/presto/downloads

https://bitbucket.org/kleinstein/presto/overview


8 Chapter 2. Download

CHAPTER 3

Installation

The simplest way to install the latest stable release of pRESTO is via pip:

> pip3 install presto --user

The current development build can be installed using pip and mercurial in similar fashion:

> pip3 install hg+https://bitbucket.org/kleinstein/presto#default --user

If you currently have a development version installed, then you will likely need to add the arguments --upgrade--no-deps --force-reinstall to the pip3 command.

3.1 Requirements

• Python 3.4.0

• setuptools 2.0

• NumPy 1.8

• SciPy 0.14

• pandas 0.15

• Biopython 1.65

• AlignSets requires MUSCLE v3.8

• ClusterSets requires USEARCH v7.0 or vsearch v2.3.2

• AssemblePairs-reference requires USEARCH v7.0 or BLAST+ 2.5

9

http://python.org

http://bitbucket.org/pypa/setuptools

http://numpy.org

http://scipy.org

http://pandas.pydata.org

http://biopython.org

http://www.drive5.com/muscle

http://www.drive5.com/usearch

https://github.com/torognes/vsearch


ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST


3.2 Linux

1. The simplest way to install all Python dependencies is to install the full SciPy stack using the instructions, theninstall Biopython according to its instructions.

2. Download the pRESTO bundle and run:

> pip3 install presto-x.y.z.tar.gz --user

3.3 Mac OS X

1. Install Xcode. Available from the Apple store or developer downloads.

2. Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the XQuartz project.

3. Install Homebrew following the installation and post-installation instructions.

4. Install Python 3.4.0+ and set the path to the python3 executable:

> brew install python3> echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile

5. Exit and reopen the terminal application so the PATH setting takes effect.

6. You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hourto install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it usingHomebrew:

> brew install gfortran

If the above fails run this instead:

> brew install --env=std gfortran

7. Install NumPy, SciPy, pandas and Biopyton using the Python package manager:

> pip3 install numpy scipy pandas biopython

8. Download pRESTO bundle, open a terminal window, change directories to download location, and run:

> pip3 install presto-x.y.z.tar.gz

3.4 Windows

1. Install Python 3.4.0+ from Python, selecting both the options ‘pip’ and ‘Add python.exe to Path’.

2. Install NumPy, SciPy, pandas and Biopython using the packages available from the Unofficial Windows binarycollection.

3. Download pRESTO bundle, open a Command Prompt, change directories to the download folder, and run:

> pip install presto-x.y.z.tar.gz

4. For a default installation of Python 3.4, the pRESTO scripts will be installed into C:\Python34\Scriptsand should be directly executable from the Command Prompt. If this is not the case, then follow step 5 below.

10 Chapter 3. Installation

http://scipy.org/install.html

http://biopython.org/DIST/docs/install/Installation.html

http://developer.apple.com/downloads

http://xquartz.macosforge.org/landing

http://brew.sh

http://python.org/downloads

http://www.lfd.uci.edu/~gohlke/pythonlibs


5. Add both the C:\Python34 and C:\Python34\Scripts directories to your %Path%. On Windows 7 the%Path% setting is located under Control Panel -> System and Security -> System -> Advanced System Settings-> Environment variables -> System variables -> Path.

6. If you have trouble with the .py file associations, try adding .PY to your PATHEXT environment variable.Also, opening a command prompt as Administrator and run:

> assoc .py=Python.File> ftype Python.File="C:\Python34\python.exe" "%1" %*

3.4. Windows 11


12 Chapter 3. Installation

CHAPTER 4

Workflows

4.1 Roche 454 BCR mRNA with Multiplexed Samples

4.1.1 Overview of Experimental Data

This example uses publicly available data from:

Lineage structure of the human antibody repertoire in response to influenza vaccination.Jiang N, He J, and Weinstein JA, et al.Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.

Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this exam-ple, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloadedusing fastq-dump from the SRA Toolkit:

fastq-dump -X 50000 SRR765688

Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available inthe published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverseprimers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.

Read Configuration

Fig. 4.1: Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the samplebarcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction ofthe V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.

13

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software


Example Data

We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, withaccompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloadedfrom here:

Jiang, He and Weinstein et al, 2013 Example Files

4.1.2 Overview of the Workflow

The example that follows performs all processing steps to arrive at high-quality unique sequences using this exampledata set. The workflow is derived into three high level tasks:

1. Quality control

2. Sample barcode and primer identification

3. Deduplication and filtering

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shownbelow.

Flowchart

Fig. 4.2: Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided intothree primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplicationand filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by eachtool are not shown for the sake of brevity.

Commands

1 #!/usr/bin/env bash2 FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log3 FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log4 MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \5 --start 0 --maxerror 0.1 --mode cut --outname S43-MID --log MPM.log6 MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \7 --maxlen 50 --maxerror 0.3 --mode mask --outname S43-FWD --log MPV.log8 MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \9 --maxlen 50 --maxerror 0.3 --revpr --skiprc --mode cut \

10 --outname S43-REV --log MPC.log11 ParseHeaders.py expand -s S43-REV_primers-pass.fastq -f PRIMER12 ParseHeaders.py rename -s S43-REV*reheader.fastq -f PRIMER1 PRIMER2 PRIMER3 \13 -k MID VPRIMER CPRIMER --outname S4314 CollapseSeq.py -s S43_reheader.fastq -n 20 --inner --uf MID CPRIMER \15 --cf VPRIMER --act set --outname S4316 SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S4317 ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER18 ParseLog.py -l FSL.log -f ID LENGTH19 ParseLog.py -l FSQ.log -f ID QUALITY20 ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Download Commands

14 Chapter 4. Workflows

http://clip.med.yale.edu/immcantation/examples/Jiang2013_Example.tar.gz


4.1.3 Quality control

The initial stage of the workflow involves two executions of the FilterSeq tool. First, the length subcommand isused to filter reads which are too short to yield full V(D)J sequences using a liberal minimum length requirement of300 bp (-n 300):

2 FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log

Next, the quality subcommand removes sequences having a mean Phred quality score below 20 (-q 20).

3 FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log

The ParseLog tool is then used to extract the results from the two FilterSeq log files:

18 ParseLog.py -l FSL.log -f ID LENGTH19 ParseLog.py -l FSQ.log -f ID QUALITY

To create two tab-delimited files containing the following results for each read:

Field DescriptionID Sequence nameLENGTH Sequence lengthQUALITY Quality score

4.1.4 Sample barcode and primer identification

Annotation of sample barcodes

Following the initial filtering steps, additional filtering is performed with three iterations of the MaskPrimers toolbased upon the presence of recognized sample barcode (MID), forward primer, and reverse primer sequences. As theorientation and position of the sample barcode is known, the first pass through MaskPrimers uses the faster scoresubcommand which requires a fixed start position (--start 0) and a low allowable error rate (--maxerror 0.1):

4 MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \5 --start 0 --maxerror 0.1 --mode cut --outname S43-MID --log MPM.log

Primer masking and annotation

The next MaskPrimers task uses the align subcommand to identify both the start position of the V-region primerand correct the orientation of the sequence such that all reads are now oriented in the direction of the V(D)J readingframe), as determined by the orientation of the V-region primer match:

6 MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \7 --maxlen 50 --maxerror 0.3 --mode mask --outname S43-FWD --log MPV.log

The final MaskPrimers task locates the C-region primer, which is used for isotype assignment of each read. As allsequences are assumed to have been properly oriented by the second MaskPrimers task, the additional argumentsMaskPrimers align --revpr and MaskPrimers align --skiprc are added to the third execution.The MaskPrimers align --revpr argument informs the tool that primers sequences should be reverse com-plemented prior to alignment, and that a match should be searched for (and cut from) the tail end of the sequence. TheMaskPrimers align --skiprc argument tells the tool to align against only the forward sequence; meaning, itwill not check primer matches against the reverse complement sequence and it will not reorient sequences:

4.1. Roche 454 BCR mRNA with Multiplexed Samples 15


8 MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \9 --maxlen 50 --maxerror 0.3 --revpr --skiprc --mode cut \

10 --outname S43-REV --log MPC.log

At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file ofeach MaskPrimers tasks:

20 ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Which will contain the following information for each log file:

Field DescriptionID Sequence namePRIMER Primer or sample barcode nameERROR Primer match error rate

Modifying the sample barcode and primer annotations

During each iteration of the MaskPrimers tool the PRIMER annotation field was updated with an additional value,such that after three iterations each sequences contains an annotation of the form:

PRIMER=sample barcode,V-region primer,C-region primer

To simplify later analysis, the ParseHeaders tool is used to first expand this single annotation into three separateannotations using the expand subcommand, which are then renaming to MID, VPRIMER, and CPRIMER using therename subcommand:

11 ParseHeaders.py expand -s S43-REV_primers-pass.fastq -f PRIMER12 ParseHeaders.py rename -s S43-REV*reheader.fastq -f PRIMER1 PRIMER2 PRIMER3 \13 -k MID VPRIMER CPRIMER --outname S43

4.1.5 Deduplication and filtering

Removal of duplicate sequences

The final stage of the workflow involves two filtering steps to yield unique sequences for each sample barcode. First,the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued posi-tions (-n 20 and --inner), and requiring that all reads considered duplicated share the same isotype and samplebarcode tag (--uf MID CPRIMER). Additionally, the V-region primer annotations of the set of duplicate reads arepropagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

14 CollapseSeq.py -s S43_reheader.fastq -n 20 --inner --uf MID CPRIMER \15 --cf VPRIMER --act set --outname S43

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicateremoval, the data is subset to only those unique sequence with at least two representative reads by using the groupsubcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

16 SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43



Creating an annotation table

Finally, the annotations, including the sample barcode (MID), duplicate read count (DUPCOUNT), isotype (CPRIMER)and V-region primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimitedfile using the table subcommand of ParseHeaders:

17 ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER

Note: Optionally, you may split each sample into separate files using the MID annotation and an alternate invocationof SplitSeq. The group subcommand may be used to split files on a categorical field, rather than a numerical field,by skipping the --num argument:

SplitSeq.py group -s M1_collapse-unique.fastq -f MID

Will split the unique sequence file into a set of separate files according the the valud in the MID field (-f MID), suchthat each file will contain sequences from only one sample.

4.1.6 Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST),and tables that can be plotted for quality control are:

File DescriptionS43_collapse-unique.fastq Total unique sequencesS43_atleast-2.fastq Unique sequences represented by at least 2 readsS43_atleast-2_headers.tab Annotation table of the atleast-2 fileFSL_table.tab Table of the FilterSeq-length logFSQ_table.tab Table of the FilterSeq-quality logMPM_table.tab Table of the MID MaskPrimers logMPV_table.tab Table of the V-region MaskPrimers logMPC_table.tab Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversionof processing steps. These files are not listed in the table above.

4.1.7 Performance

Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance wasmeasured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memoryusage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 samplebarcodes, 11 V-segment primers, and 5 constant region primers.

Line Tool Reads Cores MB Minutes01 FilterSeq.py length 1,346,039 10 669 12.202 FilterSeq.py quality 1,184,905 10 619 11.503 MaskPrimers.py score 1,184,219 10 621 18.504 MaskPrimers.py align 1,177,245 10 618 263.105 MaskPrimers.py align 959,920 10 563 122.907 ParseHeaders.py expand 548,658 1 329 3.608 ParseHeaders.py rename 548,658 1 329 4.109 CollapseSeq.py 548,658 1 1,509 5.6

4.1. Roche 454 BCR mRNA with Multiplexed Samples 17


4.2 Illumina MiSeq 2x250 BCR mRNA



Quantitative assessment of the robustness of next-generation sequencing of antibody variable generepertoires from immunized mice.Greiff, V. et al.BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.

Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For thisexample, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which maydownloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 ERR346600

Primers sequences are available in additional file 1 of the publication.

Read Configuration

Fig. 4.3: Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variableregion of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads beginwith a random sequences of 4 nucleotides.

Example Data

We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format,with accompanying primer files and an example workflow script. The sample data set and workflow script may bedownloaded from here:

Greiff et al, 2014 Example Files


In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fullyannotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

1. Paired-end assembly

2. Quality control and primer annotation





http://clip.med.yale.edu/immcantation/examples/Greiff2014_Example.tar.gz


Fig. 4.4: Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided intothree primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplicationand filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.

Flowchart

Commands

1 #!/usr/bin/env bash2 AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \3 --coord sra --rc tail --outname M1 --log AP.log4 FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log5 MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \6 --start 4 --mode mask --outname M1-FWD --log MPV.log7 MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \8 --start 4 --mode cut --revpr --outname M1-REV --log MPC.log9 ParseHeaders.py expand -s M1-REV_primers-pass.fastq -f PRIMER

10 ParseHeaders.py rename -s M1-REV_primers-pass_reheader.fastq -f PRIMER1 PRIMER2 \11 -k VPRIMER CPRIMER --outname M112 CollapseSeq.py -s M1_reheader.fastq -n 20 --inner --uf CPRIMER \13 --cf VPRIMER --act set --outname M114 SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M115 ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER16 ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE17 ParseLog.py -l FS.log -f ID QUALITY18 ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Download Commands

4.2.3 Paired-end assembly

Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand ofthe AssemblePairs tool:

2 AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \3 --coord sra --rc tail --outname M1 --log AP.log

During assembly we have defined read 2 (V-region) as the head of the sequence (-1) and read 1 as the tail of thesequence (-2). The --coord argument defines the format of the sequence header so that AssemblePairs can properlyidentify mate-pairs; in this case, we use --coord sra as our headers are in the SRA/ENA format.

Note: For both the AssemblePairs and PairSeq commands using the correct --coord argument is critical formatching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then theappropriate argument would be --coord illumina.

The ParseLog tool is then used to build a tab-delimited file of results from the AssemblePairs log file:

16 ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE

Which will containing the following columns:

4.2. Illumina MiSeq 2x250 BCR mRNA 19


Field DescriptionID Sequence nameLENGTH Length of the assembled sequenceOVERLAP Length of the overlap between mate-pairsERROR Mismatch rate of the overlapping regionPVALUE P-value for the assembly

See also:

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we haveexcluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads.The reference subcommand of AssemblePairs can use the ungapped V-region reference sequences to properlyspace non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairstogether end-to-end with some intervening gap.

4.2.4 Quality control and primer annotation

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand ofthe FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

4 FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log

The ParseLog tool is then used to build tab-delimited file from the FilterSeq log:

17 ParseLog.py -l FS.log -f ID QUALITY

Capturing the following annotations:

Field DescriptionID Sequence nameQUALITY Quality score

Read annotation and masking of primer regions

When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatichypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mu-tations in downstream applications. The score subcommand of MaskPrimers is used to identify and remove theV-region and C-region PCR primers for both reads:

5 MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \6 --start 4 --mode mask --outname M1-FWD --log MPV.log7 MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \8 --start 4 --mode cut --revpr --outname M1-REV --log MPC.log

In this data set the authors have added a random sequence of 4 bp to the start of each read before the primer sequenceto increase sequence diversity and the reliability of cluster calling on the Illumina platform. As such, both primersbegin at position 4 (--start 4), but the C-region primer begins 4 bases from the end of the assembled read. Theaddition of the --revpr argument to the second MaskPrimers step instructs the tool to reverse complement theprimer sequences and check the tail of the read. The two primer regions have also been treated differently. The V-region primer has been masked (replaced by Ns) using the --mode mask argument to preserve the V(D)J length,while the C-region primer has been removed from the sequence using the --mode cut argument.



Note: This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read1 is always the C-region end of the amplicon and read 2 is always the V-region end. If your data is unstranded (50%of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers step to account for this byusing the align subcommand instead:

MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \--maxlen 30 --mode mask --log MP1.log

This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read formatches, and correcting the the output sequences to the forward orientation (V to J).

During each iteration of the MaskPrimers tool the PRIMER annotation field was updated with an additional value,such that after both iterations each sequences contains an annotation of the form:

PRIMER=V-region primer,C-region primer

To simplify later analysis, the ParseHeaders tool is used to first expand this single annotation into two separate anno-tations using the expand subcommand. Then, the expanded fields are renamed to VPRIMER and CPRIMER usingthe rename subcommand:

9 ParseHeaders.py expand -s M1-REV_primers-pass.fastq -f PRIMER10 ParseHeaders.py rename -s M1-REV_primers-pass_reheader.fastq -f PRIMER1 PRIMER2 \11 -k VPRIMER CPRIMER --outname M1

To summarize these steps, the ParseLog tool is used to build tab-delimited files from the two MaskPrimers logs:

18 ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Capturing the following annotations:

Field DescriptionID Sequence namePRIMER Primer nameERROR Primer match error rate



The last stage of the workflow involves two filtering steps to yield the final repertoire. First, the set of unique sequencesis identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner),and requiring that all reads considered duplicates share the same C-region primer annotation (--uf CPRIMER).Additionally, the V-region primer annotations of the set of duplicate reads are propagated into the annotation of eachretained unique sequence (--cf VPRIMER and --act set):

12 CollapseSeq.py -s M1_reheader.fastq -n 20 --inner --uf CPRIMER \13 --cf VPRIMER --act set --outname M1

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicateremoval, the data is subset to only those unique sequence with at least two representative reads by using the groupsubcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

4.2. Illumina MiSeq 2x250 BCR mRNA 21


14 SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1


Finally, the annotations, including duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-region primer(VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using thetable subcommand of ParseHeaders:

15 ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER

4.2.6 Output files


File DescriptionM1_collapse-unique.fastq Total unique sequencesM1_atleast-2.fastq Unique sequences represented by at least 2 readsM1_atleast-2_headers.tab Annotation table of the atleast-2 fileAP_table.tab Table of the AssemblePairs logFS_table.tab Table of the FilterSeq logMPV_table.tab Table of the V-region MaskPrimers logMPC_table.tab Table of the C-region MaskPrimers log


4.2.7 Performance

4.3 UMI Barcoded Illumina MiSeq 2x250 BCR mRNA



B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes.Stern JNH, Yaari G, and Vander Heiden JA, et al.Sci Transl Med. 2014. 6(248):248ra107. doi:10.1126/scitranslmed.3008879.

Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475.For this example, we will use the first 25,000 sequences of sample M12 (accession: SRR1383456), which may down-loaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 SRR1383456

Primers sequences are available online at the supplemental website for the publication.



http://clip.med.yale.edu/papers/Stern2014STM


Fig. 4.5: Schematic of the Illumina MiSeq 2x250 paired-end reads with UMI barcodes. Each 250 base-pair readwas sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of theIg heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in theopposite orientation (reverse complement), and contains a 15 nucleotide UMI barcode preceding the C-region primersequence.

Read Configuration

Example Data

We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format withaccompanying primer files. The sample data set and workflow script may be downloaded from here:

Stern, Yaari and Vander Heiden et al, 2014 Example Files


In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fullyannotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

1. Quality control, UMI annotation and primer masking

2. Generation of UMI consensus sequences

3. Paired-end assembly of UMI consensus sequences



Flowchart

Fig. 4.6: Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided intofour primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensussequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtainthe high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by eachtool are not shown for the sake of brevity.

Commands

1 #!/usr/bin/env bash2 FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log3 FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log4 MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \5 --start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log6 MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \7 --start 0 --mode mask --outname MS12_R2 --log MP2.log8 PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \9 --1f BARCODE --coord sra

10 BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \

4.3. UMI Barcoded Illumina MiSeq 2x250 BCR mRNA 23

http://clip.med.yale.edu/immcantation/examples/Stern2014_Example.tar.gz


11 --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log12 BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \13 --maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log14 PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \15 --coord presto16 AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \17 -2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \18 --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log19 ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min20 CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \21 --cf CONSCOUNT --act sum --outname MS1222 SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS1223 ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT24 ParseLog.py -l FS1.log FS2.log -f ID QUALITY25 ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR26 ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \27 PRFREQ ERROR28 ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE FIELDS1 FIELDS2

Download Commands

4.3.3 Quality control, UMI annotation and primer masking

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand ofthe FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

2 FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log3 FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log

The ParseLog tool is then used to extract results from the FilterSeq logs into tab-delimited files:

24 ParseLog.py -l FS1.log FS2.log -f ID QUALITY

Extracting the following information from the log:

Field DescriptionID Sequence nameQUALITY Quality score

UMI annotation and masking of primer regions

Next, the score subcommand of MaskPrimers is used to identify and remove the PCR primers for both reads. Whendealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hyper-mutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations indownstream applications. The MaskPrimers tool is also used to annotate each read 1 sequence with the 15 nucleotideUMI that precedes the C-region primer (MaskPrimers score --barcode):

4 MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \5 --start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log6 MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \7 --start 0 --mode mask --outname MS12_R2 --log MP2.log



To summarize these steps, the ParseLog tool is used to build a tab-delimited file from the MaskPrimers log:

25 ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR

Containing the following information:

Field DescriptionID Sequence namePRIMER Primer nameBARCODE UMI sequenceERROR Primer match error rate

Note: For this data set the UMI is immediately upstream of the C-region primer. Another common approach forUMI barcoding involves placing the UMI immediately upstream of a 5’RACE template switch site. Modifying theworkflow is simple for this case. You just need to replace the V-region primers with a fasta file containing the TSsequences and move the --barcode argument to the appropriate read:

MaskPrimers.py score -s R1_quality-pass.fastq -p CPrimers.fasta \--start 0 --mode cut --outname R1 --log MP1.log

MaskPrimers.py score -s R2_quality-pass.fastq -p TSSites.fasta \--start 17 --barcode --mode cut --maxerror 0.5 \--outname R2 --log MP2.log

In the above we have moved the UMI annotation to read 2, increased the allowable error rate for matching the TS site(--maxerror 0.5), cut the TS site (--mode cut), and increased the size of the UMI from 15 to 17 nucleotides(--start 17).

4.3.4 Generation of UMI consensus sequences

Copying the UMI annotation across paired-end files

In this task, a single consensus sequence is constructed for each set of reads annotated with the same UMI barcode.As the UMI barcode is part of read 1, the BARCODE annotation identified by MaskPrimers must first be copied to theread 2 mate-pair of each read 1 sequence. Propogation of annotations between mate pairs is performed using PairSeqwhich also removes unpaired reads and ensures that paired reads are sorted in the same order across files:

8 PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \9 --1f BARCODE --coord sra

Note: For both the PairSeq and AssemblePairs commands using the correct --coord argument is critical formatching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then theappropriate argument would be --coord illumina.

Note: If you have followed the 5’RACE modification above, then you must also modify the first PairSeq step to copythe UMI from read 2 to read 1, instead of vice versa (--2f BARCODE):

PairSeq.py -1 R1_primers-pass.fastq -2 R2_primers-pass.fastq \--2f BARCODE --coord sra



Multiple alignment of UMI read groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned.Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variationsin the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should beamplified with the same primer. However, different primers in the multiplex pool may be incorporated into the sameUMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment canbe corrected using the AlignSets tool. In the example data used here, this step was not necessary due to the alignedprimer design for the 45 V-region primers, though this does require that the V-region primers be masked, rather thancut, during the MaskPrimers step (--mode mask).

See also:

If your data requires alignment, then you can create multiple aligned UMI read groups as follows:

AlignSets.py muscle -s R1_primers-pass_pair-pass.fastq --bf BARCODE \--exec ~/bin/muscle --outname R1 --log AS1.log

AlignSets.py muscle -s R2_primers-pass_pair-pass.fastq --bf BARCODE \--exec ~/bin/muscle --outname R2 --log AS2.log

Where the --bf BARCODE defines the field containing the UMI and --exec ~/bin/muscle is the location ofthe MUSCLE executable.

For additional details see the section on fixing UMI alignments.

Generating UMI consensus reads

After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus:

10 BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \11 --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log12 BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \13 --maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log

To correct for UMI chemistry and sequencing errors, UMI read groups having high error statistics (mismatch rate fromconsensus) are removed by specifiying the --maxerror 0.1 threshold. As the accuracy of the primer assignmentin read 1 is critical for correct isotype identification, additional filtering of read 1 is carried out during this step.Specifying the --prcons 0.6 threshold: (a) removes individual sequences that do not share a common primerannotation with the majority of the set, (b) removes entire read groups which have ambiguous primer assignments, and(c) constructs a consensus primer assignment for each UMI.

Note: The --maxgap 0.5 argument tells BuildConsensus to use a majority rule to delete any gap positions whichoccur in more than 50% of the reads. The --maxgap argument is not really necessary for this example data set aswe did not perform a multiple alignment of the UMI read groups. However, if you have performed an alignment, thenuse of --maxgap during consensus generation is highly recommended.

The ParseLog tool is then used to build a tab-delimited file contain the consensus results:

26 ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \

With the following annotations:



Field DescriptionBARCODE UMI sequenceSEQCOUNT Number of total reads in the UMI groupCONSCOUNT Number of reads used for the UMI consensusPRIMER Set of primer names in the UMI groupPRCONS Consensus primer namePRCOUNT Count of primers in the UMI groupPRFREQ Frequency of primers in the UMI groupERROR Average mismatch rate from consensus

4.3.5 Paired-end assembly of UMI consensus sequences

Syncronizing paired-end files

Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMIread group filtering by BuildConsensus. To synchronize the reads another instance of PairSeq must be run, but withoutany annotation manipulation:

14 PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \15 --coord presto

Assembling UMI consensus mate-pairs

Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Igsequence using the align subcommand of AssemblePairs:

16 AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \17 -2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \18 --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log

During assembly, the consensus isotype annotation (PRCONS) from read 1 and the number of reads used to define theconsensus sequence (CONSCOUNT) for both reads are propagated into the annotations of the full length Ig sequence(--1f CONSCOUNT --2f CONSCOUNT PRCONS.

ParseLog is then uses to extract the results from the AssemblePairs log into a tab-delimited file:

27 PRFREQ ERROR

Containing the following information:

Field DescriptionID Sequence name (UMI)LENGTH Length of the assembled sequenceOVERLAP Length of the overlap between mate-pairsERROR Mismatch rate of the overlapping regionPVALUE P-value for the assemblyFIELDS1 Annotations copied from read 2 into the assembled sequenceFIELDS2 Annotations copied from read 1 into the assembled sequence

See also:

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we haveexcluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads.The reference subcommand of AssemblePairs can use the ungapped V-region reference sequences to properly



space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairstogether end-to-end with some intervening gap.


Combining UMI read group size annotations

In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, theannotation specifying the number of raw reads used to build each sequence (-f CONSCOUNT) is updated to be theminimum (--act min) of the forward and reverse reads using the collapse subcommand of ParseHeaders:

19 ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min


Second, duplicate nucleotide sequences are removed using the CollapseSeq tool with the requirement that duplicatesequences share the same isotype primer (--uf PRCONS). The duplicate removal step also removes sequences witha high number of interior N-valued nucleotides (-n 20 and --inner) and combines the read counts for each UMIread group (--cf CONSCOUNT and --act sum).

20 CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \21 --cf CONSCOUNT --act sum --outname MS12

Filtering to sequences with at least two representative reads

Finally, unique sequences are filtered to those with at least 2 contributing sequences using the group subcommand ofSplitSeq, by splitting the file on the CONSCOUNT annotation with a numeric threshold (-f CONSCOUNT and --num2):

22 SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12


For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcom-mand of ParseHeaders:

23 ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT

4.3.7 Output files




File DescriptionM12_collapse-unique.fastq Total unique sequencesM12_atleast-2.fastq Unique sequences represented by at least 2 readsM12_atleast-2_headers.tab Annotation table of the atleast-2 fileFS1_table.tab Table of the read 1 FilterSeq logFS2_table.tab Table of the read 2 FilterSeq logMP1_table.tab Table of the C-region MaskPrimers logMP2_table.tab Table of the V-region MaskPrimers logBC1_table.tab Table of the read 1 BuildConsensus logBC2_table.tab Table of the read 2 BuildConsensus logAP_table.tab Table of the AssemblePairs log


4.3.8 Performance

Example performance statistics for a comparable, but larger, MiSeq workflow are presented below. Performance wasmeasured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memoryusage measured at peak utilization. The data set contained 1,723,558 x 2 raw reads, and required matching of 1constant region primer, 45 V-segment primers, and averaged 24.3 reads per UMI.

Line Tool Reads Cores MB Minutes01 R1: FilterSeq.py quality 1,723,558 10 1,219 13.002 R2: FilterSeq.py quality 1,723,558 10 1,219 12.803 R1: MaskPrimers.py score 1,722,116 10 1,221 18.904 R2: MaskPrimers.py score 1,684,050 10 1,221 46.706 PairSeq.py 1,665,584 1 4,734 25.507 R1: BuildConsensus.py 1,565,017 10 1,228 50.108 R2: BuildConsensus.py 1,565,017 10 1,229 58.611 AssemblePairs.py align 66,285 10 358 3.013 ParseHeaders.py collapse 56,104 1 88 0.414 CollapseSeq.py 55,480 1 822 0.715 SplitSeq.py group 51,047 1 88 0.2




CHAPTER 5

Fixing UMI Problems

5.1 Correcting misaligned V-segment primers and indels in UMIgroups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned.Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variationsin the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should beamplified with the same primer. However, different primers in the multiplex pool may be incorporated into the sameUMI read group during amplification if the primers are sufficiently similar.

Fig. 5.1: Correction of misaligned sequences. (A) Discrepancies in the location of primer binding (colored bases,with primer name indicated to the left) may cause misalignment of sequences sharing a UMI. (B) Following multiplealignment of the reads the non-primer regions are correctly aligned and suitable for UMI consensus generation.

This type of primer misalignment can be corrected using one of two approaches using the AlignSets tool. The firstapproach, which is conceptually simpler but computationally more expensive, is to perform a full multiple alignment ofreach UMI read group using the muscle subcommand of AlignSets. The --bf BARCODE argument tells AlignSetsto multiple align reads sharing the same BARCODE annotation. The --exec ~/bin/muscle is a pointer to wherethe MUSCLE executable is located:

AlignSets.py muscle -s reads.fastq --bf BARCODE --exec ~/bin/muscle

The above approach will also insert gaps into the sequences where an insertion/deletion has occured in the reads. Assuch, you will need to provide as reasonable gap character threshold to BuildConsensus, such as --maxgap 0.5,defining how you want to handle positions with gap characters when generating a UMI consensus sequence.

Note: Using the muscle subcommand, along with the --maxgap argument to BuildConsensus will also addressissue with insertions/deletions in UMI read groups. Though, in UMI read groups with a sufficient number of readsconsensus generation will resolve insertions/deletions without the need for multiple alignment, as any misaligned readswill simply be washed out by the majority. Whether to perform a multiple alignment prior to consensus generation

31

http://www.drive5.com/muscle


is a matter of taste. A multiple alignment may improve consensus quality in small UMI read groups (eg, less than 4sequences), but the extent to which small UMI read groups should be trusted is debatable.

The second approach will correct only the primer regions and will not address insertions/deletions within the sequence,but is much quicker to perform. The first step involves creation of a primer offset table using the table subcommandof AlignSets:

AlignSets.py table -p primers.fasta --exec ~/bin/muscle

Which performs a multiple alignment on sequences in primers.fasta (sequences shown in the primer alignmentfigure above) to generate a file containing a primer offset table:

Listing 5.1: primers_offsets-forward.tab

VP1 2VP2 0VP3 1

Then the offset table can be input into the offset subcommand of AlignSets to align the reads:

AlignSets.py offset -s reads.fastq -d primers_offsets-forward.tab \--bf BARCODE --pr VPRIMER --mode pad

In the above command we have specified the field containing the primer annotation using --pr VPRIMER and setthe behavior of the tool to add gap characters to align the reads with the --mode pad argument. These optionswill generate the correction shown in (B) of the primer alignment figure above. Alternatively, we could have deletedunalign positions using the argument --mode cut.

Note: You may need to alter how the offset table is generated if you have used the --mode cut argument toMaskPrimers rather than --mode mask, as this will cause the ends of the primer regions, rather than the front, tobe the cause of the ragged edges within the UMI read groups. For primers that have been cut you would add the--reverse argument to the table operation of AlignSets, which will create an offset table that is based on the tailend of the primers.

5.2 Dealing with insufficient UMI diversity

Due to errors in the UMI region and/or insufficient UMI length, UMI read groups are not always homogeneous withrespect to the mRNA of origin. This can cause difficulties in generating a valid UMI consensus sequence. In mostcases, the --prcons and --maxerror (or --maxdiv) arguments to BuildConsensus are sufficient to filter outinvalid reads and/or entire invalid UMI groups. However, if there is significant nucleotide diversity within UMI groupsdue to insufficient UMI length or low UMI diversity, the ClusterSets tool can help correct for this. ClusterSets willcluster sequence by similarity and add an additional annotation dividing sequences within a UMI read group intosub-clusters:

ClusterSets.py -s reads.fastq -f BARCODE -k CLUSTER --exec ~/bin/usearch

The above command will add an annotation to each sequence named CLUSTER (-k CLUSTER) containing a clusteridentifier for each sequence within the UMI barcode group. The -f BARCODE argument specifies the UMI anno-tation and --exec ~/bin/usearch is a pointer to where the USEARCH executable is located. After assigningcluster annotations via ClusterSets, the BARCODE and CLUSTER fields can be merged using the copy operation ofParseHeaders:

32 Chapter 5. Fixing UMI Problems



ParseHeaders.py copy -s reads_cluster-pass.fastq -f BARCODE -k CLUSTER --act cat

Which will copy the UMI annotation (-f BARCODE) into the cluster annotation (-k CLUSTER) and concatenatethem together (--act cat). Thus converting the annotations from:

>SEQ1|BARCODE=ATGTCG|CLUSTER=1>SEQ2|BARCODE=ATGTCG|CLUSTER=2

To:

>SEQ1|BARCODE=ATGTCG|CLUSTER=1ATGTCG>SEQ2|BARCODE=ATGTCG|CLUSTER=2ATGTCG

You may then specify --bf CLUSTER to BuildConsensus to tell it to generate UMI consensus sequences by UMIsub-cluster, rather than by UMI barcode annotation.

5.3 Combining split UMIs

Typically, a UMI barcode is attached to only one end of a paired-end mate-pair and can be copied to other read by asimple invocation of PairSeq. But in some cases, the UMI may be split such that there are two UMIs, each located ona different mate-pair. To deal with these sorts of UMIs, you would first employ PairSeq similarly to how you would inthe single UMI case:

PairSeq.py -1 reads-1.fastq -2 reads-2.fastq –1f BARCODE –2f BARCODE –coord illumina

The main difference from the single UMI case is that the BARCODE annotation is being simultaneously copied fromread 1 to read 2 (--1f BARCODE) andfrom read 2 to read 1 (--2f BARCODE). This creates a set of annotationsthat look like:

>READ1|BARCODE=ATGTCGTT,GGCTAGTC>READ2|BARCODE=ATGTCGTT,GGCTAGTC

These annotations can then be cleaned up using the collapse operation of ParseHeaders:

ParseHeaders.py collapse -s reads-[1-2]_pair-pass.fastq -f BARCODE --act cat

Which concatenates (--act cat) the two values in the BARCODE field (-f BARCODE), yielding UMI annotationssuitable for input to BuildConsensus:

>READ1|BARCODE=ATGTCGTTGGCTAGTC>READ2|BARCODE=ATGTCGTTGGCTAGTC

5.4 Estimating sequencing and PCR error rates with UMI data

The EstimateError tool provides methods for estimating the combined PCR and sequencing error rates from largeUMI read groups. The assumptions being, that consensus sequences generated from sufficiently large UMI readgroups should be accurate representations of the true sequences, and that the rate of mismatches from consensusshould therefore be an accurate estimate of the error rate in the data. However, this is not guaranteed to be true, hencethis approach can only be considered an estimate of a data set’s error profile. The following command generates anerror profile from UMI read groups with 50 or more sequences (-n 50), using a majority rule consensus sequence(--mode freq), and excluding UMI read groups with high nucleotide diversity (--maxdiv 0.1):

5.3. Combining split UMIs 33


EstimateError.py -s reads.fastq -n 50 --mode freq --maxdiv 0.1

This generates the following tab-delimited files containing error rates broken down by various criteria:

File Error profilereads_error-position.tab Error rates by read positionreads_error-quality.tab Error rates by quality scorereads_error-nucleotide.tab Error rates by nucleotide identityreads_error-set.tab Error rates by UMI read group size

34 Chapter 5. Fixing UMI Problems

CHAPTER 6

Miscellaneous Tasks

6.1 Importing data from SRA, ENA or GenBank into pRESTO

If you have download a data set from GenBank, SRA or ENA the format of the sequences headers are different from theraw Roche 454 and Illumina header format. As such, they may or may not be compatible with pRESTO, depending onhow the headers have been modified by the sequence archive. The ConvertHeaders allow you to change incompatibleheader formats into the pRESTO format. For example, to convert from SRA or ENA headers the sra subcommandwould be used:

ConvertHeaders.py sra -s reads.fastq

ConvertHeaders provides the following conversion subcommands:

Subcommand Formats Convertedgeneric Headers with an unknown annotation system454 Roche 454genbank NCBI GenBank and RefSeqillumina Illumina HiSeq or MiSeqimgt IMGT/GENE-DBsra NCBI SRA or EBI ENA

6.2 Reducing file size for submission to IMGT/HighV-QUEST

IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, youcan use the count subcommand of SplitSeq to divide your files into small pieces.

SplitSeq.py count -s reads.fastq -n 500000 --fasta

The -n 500000 argument sets the maximum number of sequences in each file and the --fasta tells the tool tooutput a FASTA, rather than FASTQ, formatted file.

35

http://imgt.org/HighV-QUEST


Note: You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using theCollapseSeq tool.

6.3 Subsetting sequence files by annotation

The group subcommand of SplitSeq allows you to split one file into multiple files based on the values in a sequenceannotation. For example, splitting one file with multiple SAMPLE annotations into separate files (one for each sample)would be accomplished by:

SplitSeq.py group -s reads.fastq -f SAMPLE

Which will create a set of files labelled SAMPLE-M1 and SAMPLE-M2, if samples are named M1 and M2.

If you wanted to split based on a numeric value, rather than a set of categorical values, then you would add the --numargument. SplitSeq would then create two files: one containing sequences with values less than the threshold specifiedby the --num argument and one file containing sequences with values greater than or equal to the threshold:

SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2

Which will create two files with the labels atleast-2 and under-2.

6.4 Random sampling from sequence files

The sample subcommand of SplitSeq may be used to generate a random sample from a sequence file or set of pair-end files. The example below will select a random sample of 1,000 sequences (-n 1000) which all contain theannotation SAMPLE=M1 (-f SAMPLE and -u M1):

SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000

Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepairsubcommand:

SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina

Note: Both the -f and -n arguments will accept a list of values (eg, -n 1000 100 10), allowing you to samplemultiple times from multiple files in one command.

6.5 Cleaning or removing poor quality sequences

Data sets can be cleaned using one or more invocations of FilterSeq, which provides multiple sequence quality controloperations. Four subcommands remove sequences from the data that fail to meet some threshold: including length,(length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phredquality score (quality). Two subcommands modify sequences without removing them: trimqual truncates thesequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with lowPhred quality scores with N characters.

FilterSeq provides the following quality control subcommands:

36 Chapter 6. Miscellaneous Tasks


Subcommand Operationlength Removes short sequencesmissing Removes sequences with too many Ns or gapsrepeats Removes sequences with long homopolymeric tractsquality Removes sequences with low mean quality scorestrimqual Truncates sequences where quality scores decaymaskqual Masks low quality positions

6.6 Assembling paired-end reads that do not overlap

The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of Assemble-Pairs. However, some sequences with long CDR3 regions may fail to assemble due to insufficient, or completelyabsent, overlap between the mate-pairs. The reference or sequential subcommands can be used to assemblemate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.

To handle such sequence in two separate steps, a normal align command would be performed first. The --failedargument is added so that the reads failing de novo alignment are output to separate files:

AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \--coord illumina --failed -outname align

Then, the files labeled assemble-fail, along with the ungapped V-segment reference sequences (-r vref.fasta), would be input into the reference subcommand of AssemblePairs:

AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.→˓fastq \

--rc tail -r vref.fasta --coord illumina --outname ref

This will result in two separate assemble-pass files - one from each step. You may process them separately orconcatenate them together into a single file:

cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq

However, if you intend to processes them together, you may simplify this by perform both steps using thesequential subcommand, which will attempt de novo assembly followed by reference guided assembly if denovo assembly fails:

AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \--coord illumina -r vref.fasta

Note: The sequences output by the reference or sequential subcommands may contain an appropriate lengthspacer of Ns between any mate-pairs that do not overlap. The :option:–fill <AssemblePairs reference –fill>‘‘ argumentmay be specified to force AssemblePairs to insert the germline sequence into the missing positions, but this should beused with caution as the inserted sequence may not be biologically correct.

6.7 Assigning isotype annotations from the constant region se-quence

MaskPrimers is usually used to remove primer regions and annotate sequences with primer identifiers. However, it canbe used for any other case where you need to align a set of short sequences against the reads. One example alternate

6.6. Assembling paired-end reads that do not overlap 37


use is where you either do not know the C-region primer sequences or do not trust the primer region to provide anaccurate isotype assignment.

If you build a FASTA file containing the reverse-complement of short sequences from the front of CH-1, then you canannotate the reads with these sequence in the same way you would C-region specific primers:

MaskPrimers.py align -s reads.fastq -p IGHC.fasta --maxlen 100 --maxerror 0.3 \--mode cut --revpr

Where --revpr tells MaskPrimers to reverse-complement the “primer” sequences and look for them at the end of thereads, --maxlen 100 restricts the search to the last 100 bp, --maxerror 0.3 allows for up to 30% mismatches,and -p IGHC.fasta specifies the file containing the CH-1 sequences. An example CH-1 sequence file would looklike:

>IGHDCTGATATGATGGGGAACACATCCGGAGCCTTGGTGGGTGC>IGHMAGGAGACGAGGGGGAAAAGGGTTGGGGCGGATGCACTCCC>IGHGAGGGYGCCAGGGGGAAGACSGATGGGCCCTTGGTGGAAGC>IGHAMGAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGGGGATGC>IGHEAGCGGGTCAAGGGGAAGACGGATGGGCTCTGTGTGGAGGC

Download IGHC.fasta

See also:

Constant region reference sequences may be downloaded from IMGT and the sequence headers can be reformatedusing the imgt subcommand of ConvertHeaders. Note, you may need to clean-up the reference sequences a bitbefore running ConvertHeaders if you receive an error about duplicate sequence names (eg, remove duplicate allelewith different artificial splicing). To cut and reverse-complement the constant region sequences use something likeseqmagick.

38 Chapter 6. Miscellaneous Tasks

http://imgt.org/vquest/refseqh.html

http://seqmagick.readthedocs.io

CHAPTER 7

Commandline Usage

7.1 AlignSets

Multiple aligns input sequences by group

usage: AlignSets [--version] [-h] ...

--versionshow program’s version number and exit

-h, --helpshow this help message and exit

output files:

align-pass multiple aligned reads.

align-fail raw reads failing multiple alignment.

offsets-forward 5’ offset table for input into offset subcommand.

offsets-reverse 3’ offset table for input into offset subcommand.

output annotation fields: None

7.1.1 AlignSets muscle

Align sequence sets using muscle.

usage: AlignSets muscle [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME] [--bf BARCODE_FIELD] [--div][--exec ALIGNER_EXEC]

39




-s <seq_files>A list of FASTA/FASTQ files containing sequences to process.

--fastaSpecify to force output as FASTA rather than FASTQ.

--failedIf specified create files containing records that fail processing.

--log <log_file>Specify to write verbose logging to a file. May not be specified with multiple input files.

--delim <delimiter>A list of the three delimiters that separate annotation blocks, field names and values, and values within a field,respectively.

--nproc <nproc>The number of simultaneous computational processes to execute (CPU cores to utilized).

--outdir <out_dir>Specify to changes the output directory to the location specified. The input file directory is used if this is notspecified.

--outname <out_name>Changes the prefix of the successfully processed output file to the string specified. May not be specified withmultiple input files.

--bf <barcode_field>The annotation field containing barcode labels for sequence grouping.

--divSpecify to calculate nucleotide diversity of each set (average pairwise error rate).

--exec <aligner_exec>The name or location of the muscle executable.

7.1.2 AlignSets offset

Align sequence sets using predefined 5’ offset.

usage: AlignSets offset [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME] [-d OFFSET_TABLE][--bf BARCODE_FIELD] [--pf PRIMER_FIELD][--mode {pad,cut}] [--div]



40 Chapter 7. Commandline Usage










-d <offset_table>The tab delimited file of offset tags and values.

--bf <barcode_field>The annotation field containing barcode labels for sequence grouping.

--pf <primer_field>The primer field to use for offset assignment.

--mode {pad,cut}Specifies whether or align sequence by padding with gaps or by cutting the 5’ sequence to a common startposition.

--divSpecify to calculate nucleotide diversity of each set (average pairwise error rate).

7.1.3 AlignSets table

Create a 5’ offset table by primer multiple alignment.

usage: AlignSets table [--version] [-h] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -p PRIMER_FILE[--reverse] [--exec ALIGNER_EXEC]




7.1. AlignSets 41





-p <primer_file>A FASTA or REGEX file containing primer sequences

--reverseIf specified create a 3’ offset table instead

--exec <aligner_exec>The name or location of the muscle executable

7.2 AssemblePairs

Assembles paired-end reads into a single sequence

usage: AssemblePairs [--version] [-h] ...



output files:

assemble-pass successfully assembled reads.

assemble-fail raw reads failing paired-end assembly.

output annotation fields:

<user defined> annotation fields specified by the –1f or –2f arguments.

7.2.1 AssemblePairs align

Assemble pairs by aligning ends.

usage: AssemblePairs align [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]-2 SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta][--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME][--coord {illumina,solexa,sra,454,presto}][--rc {head,tail,both}][--1f HEAD_FIELDS [HEAD_FIELDS ...]][--2f TAIL_FIELDS [TAIL_FIELDS ...]]



[--alpha ALPHA] [--maxerror MAX_ERROR][--minlen MIN_LEN] [--maxlen MAX_LEN] [--scanrev]



-1 <seq_files_1>An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>An ordered list of FASTA/FASTQ files containing tail/secondary sequences.








--coord {illumina,solexa,sra,454,presto}The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {head,tail,both}Specify to reverse complement sequences before stitching.

--1f <head_fields>Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>Specify annotation fields to copy from tail records into assembled record.

--alpha <alpha>Significance threshold for de novo paired-end assembly.

--maxerror <max_error>Maximum allowable error rate for de novo assembly.

--minlen <min_len>Minimum sequence length to scan for overlap in de novo assembly.

--maxlen <max_len>Maximum sequence length to scan for overlap in de novo assembly.

7.2. AssemblePairs 43


--scanrevIf specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhangthe end of the tail sequence.

7.2.2 AssemblePairs join

Assemble pairs by concatenating ends.

usage: AssemblePairs join [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta] [--failed][--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME][--coord {illumina,solexa,sra,454,presto}][--rc {head,tail,both}][--1f HEAD_FIELDS [HEAD_FIELDS ...]][--2f TAIL_FIELDS [TAIL_FIELDS ...]] [--gap GAP]


















--gap <gap>Number of N characters to place between ends.

7.2.3 AssemblePairs reference

Assemble pairs by aligning reads against a reference database.

usage: AssemblePairs reference [--version] [-h] -1 SEQ_FILES_1[SEQ_FILES_1 ...] -2 SEQ_FILES_2[SEQ_FILES_2 ...] [--fasta] [--failed][--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME][--coord {illumina,solexa,sra,454,presto}][--rc {head,tail,both}][--1f HEAD_FIELDS [HEAD_FIELDS ...]][--2f TAIL_FIELDS [TAIL_FIELDS ...]] -rREF_FILE [--minident MIN_IDENT][--evalue EVALUE] [--maxhits MAX_HITS] [--fill][--aligner {blastn,usearch}][--exec ALIGNER_EXEC] [--dbexec DB_EXEC]


















-r <ref_file>A FASTA file containing the reference sequence database.

--minident <min_ident>Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and1).

--evalue <evalue>Minimum E-value for reference alignment for both the head and tail sequence.

--maxhits <max_hits>Maximum number of hits from the reference alignment to check for matching head and tail sequence assign-ments.

--fillSpecify to change the behavior of inserted characters when the head and tail sequences do not overlap duringreference guided assembly. If specified, this will result in inserted of the V region reference sequence insteadof a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences byusing this option.

--aligner {blastn,usearch}The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).

--exec <aligner_exec>The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the–aligner argument.

--dbexec <db_exec>The name or location of the executable file that builds the reference database. This defaults to makeblastdbwhen blastn is specified to the –aligner argument, and usearch when usearch is specified.

7.2.4 AssemblePairs sequential

Assemble pairs by first attempting de novo assembly, then reference guided assembly.

usage: AssemblePairs sequential [--version] [-h] -1 SEQ_FILES_1[SEQ_FILES_1 ...] -2 SEQ_FILES_2[SEQ_FILES_2 ...] [--fasta] [--failed][--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME][--coord {illumina,solexa,sra,454,presto}][--rc {head,tail,both}]



[--1f HEAD_FIELDS [HEAD_FIELDS ...]][--2f TAIL_FIELDS [TAIL_FIELDS ...]][--alpha ALPHA] [--maxerror MAX_ERROR][--minlen MIN_LEN] [--maxlen MAX_LEN][--scanrev] -r REF_FILE [--minident MIN_IDENT][--evalue EVALUE] [--maxhits MAX_HITS][--fill] [--aligner {blastn,usearch}][--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
















--alpha <alpha>Significance threshold for de novo paired-end assembly.

--maxerror <max_error>Maximum allowable error rate for de novo assembly.



--minlen <min_len>Minimum sequence length to scan for overlap in de novo assembly.

--maxlen <max_len>Maximum sequence length to scan for overlap in de novo assembly.

--scanrevIf specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhangthe end of the tail sequence.

-r <ref_file>A FASTA file containing the reference sequence database.

--minident <min_ident>Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and1).

--evalue <evalue>Minimum E-value for reference alignment for both the head and tail sequence.

--maxhits <max_hits>Maximum number of hits from the reference alignment to check for matching head and tail sequence assign-ments.

--fillSpecify to change the behavior of inserted characters when the head and tail sequences do not overlap duringreference guided assembly. If specified, this will result in inserted of the V region reference sequence insteadof a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences byusing this option.

--aligner {blastn,usearch}The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).

--exec <aligner_exec>The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the–aligner argument.

--dbexec <db_exec>The name or location of the executable file that builds the reference database. This defaults to makeblastdbwhen blastn is specified to the –aligner argument, and usearch when usearch is specified.

7.3 BuildConsensus

Builds a consensus sequence for each set of input sequences

usage: BuildConsensus [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER] [--nproc NPROC][--outdir OUT_DIR] [--outname OUT_NAME] [-n MIN_COUNT][--bf BARCODE_FIELD] [-q MIN_QUAL] [--freq MIN_FREQ][--maxgap MAX_GAP] [--pf PRIMER_FIELD][--prcons PRIMER_FREQ][--cf COPY_FIELDS [COPY_FIELDS ...]][--act {min,max,sum,set,majority} [{min,max,sum,set,

→˓majority} ...]][--dep] [--maxdiv MAX_DIVERSITY | --maxerror MAX_ERROR]













-n <min_count>The minimum number of sequences needed to define a valid consensus.

--bf <barcode_field>Position of description barcode field to group sequences by.

-q <min_qual>Consensus quality score cut-off under which an ambiguous character is assigned; does not apply when qualityscores are unavailable.

--freq <min_freq>Fraction of character occurrences under which an ambiguous character is assigned.

--maxgap <max_gap>If specified, this defines a cut-off for the frequency of allowed gap values for each position. Positions exceedingthe threshold are deleted from the consensus. If not defined, positions are always retained.

--pf <primer_field>Specifies the field name of the primer annotations

--prcons <primer_freq>Specify to define a minimum primer frequency required to assign a consensus primer, and filter out sequenceswith minority primers from the consensus building step.

--cf <copy_fields>Specifies a set of additional annotation fields to copy into the consensus sequence annotations.

--act {min,max,sum,set,majority}List of actions to take for each copy field which defines how each annotation will be combined into a single value.The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations.

7.3. BuildConsensus 49


The action “set” combines annotations into a comma delimited list of unique values and adds an annotationnamed <FIELD>_COUNT specifying the count of each item in the set. The action “majority” assigns the mostfrequent annotation to the consensus annotation and adds an annotation named <FIELD>_FREQ specifying thefrequency of the majority value.

--depSpecify to calculate consensus quality with a non-independence assumption

--maxdiv <max_diversity>Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and remove groupsexceeding the given diversity threshold. Diversity is calculate for all positions within the read group, ignoringany character filtering imposed by the -q, –freq and –maxgap arguments. Mutually exclusive with –maxerror.

--maxerror <max_error>Specify to calculate the error rate of each read group (rate of mismatches from consensus) and remove groupsexceeding the given error threshold. The error rate is calculated against the final consensus sequence, whichmay include masked positions due to the -q and –freq arguments and may have deleted positions due to the–maxgap argument. Mutually exclusive with –maxdiv.

output files:

consensus-pass consensus reads.

consensus-fail raw reads failing consensus filtering criteria.


PRIMER a comma delimited list of unique primer annotations found within the barcode read group.

PRCOUNT a comma delimited list of the corresponding counts of unique primer annotations.

PRCONS the majority primer within the barcode read group.

PRFREQ the frequency of the majority primer.

CONSCOUNT the count of reads within the barcode read group which contributed to the consensus sequence.This is the total size of the read group, minus sequence excluded due to user defined filtering criteria.

7.4 ClusterSets

Cluster sequences by group

usage: ClusterSets [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER] [--nproc NPROC][--outdir OUT_DIR] [--outname OUT_NAME] [-f BARCODE_FIELD][-k CLUSTER_FIELD] [--id IDENT] [--start SEQ_START][--end SEQ_END] [--exec CLUSTER_EXEC]













-f <barcode_field>The annotation field containing annotations, such as UID barcode, for sequence grouping.

-k <cluster_field>The name of the output annotation field to add with the cluster information for each sequence.

--id <ident>The sequence identity threshold for the uclust algorithm.

--start <seq_start>The start of the region to be used for clustering. Together with –end, this parameter can be used to specify asubsequence of each read to use in the clustering algorithm.

--end <seq_end>The end of the region to be used for clustering.

--exec <cluster_exec>The name or location of the usearch or vsearch executable.

output files:

cluster-pass clustered reads.

cluster-fail raw reads failing clustering.


CLUSTER a numeric cluster identifier defining the within-group cluster.

7.5 CollapseSeq

Removes duplicate sequences from FASTA/FASTQ files

usage: CollapseSeq [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER] [--outdir OUT_DIR][--outname OUT_NAME] [-n MAX_MISSING][--uf UNIQ_FIELDS [UNIQ_FIELDS ...]][--cf COPY_FIELDS [COPY_FIELDS ...]]

7.5. CollapseSeq 51


[--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner][--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]










-n <max_missing>Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be consideredundetermined if it contains too many missing nucleotides.

--uf <uniq_fields>Specifies a set of annotation fields that must match for sequences to be considered duplicates.

--cf <copy_fields>Specifies a set of annotation fields to copy into the unique sequence output.

--act {min,max,sum,set}List of actions to take for each copy field which defines how each annotation will be combined into a single value.The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations.The action “set” collapses annotations into a comma delimited list of unique values.

--innerIf specified, exclude consecutive missing characters at either end of the sequence.

--keepmissIf specified, sequences with more missing characters than the threshold set by the -n parameter will be writtento the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will bewritten to a separate file.

--maxf <max_field>Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.

--minf <min_field>Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.



output files:

collapse-unique unique sequences. Contains one representative from each set of duplicate sequences. Theretained representative is determined by user defined criteria.

collapse-duplicate raw reads which are duplicates of the sequences retained in the collapse-unique file.

collapse-undetermined raw reads which were excluded from consideration due to having too many N charac-ters in the sequence.


DUPCOUNT total number of sequences within the set of duplicates for each retained unique sequence. Mean-ing, the copy number of each unique sequence within the data file.

<user defined> annotation fields specified by the –cf parameter.

7.6 ConvertHeaders

Converts sequence headers to the pRESTO format

usage: ConvertHeaders [--version] [-h] ...



output files:

convert-pass reads passing header conversion.

convert-fail raw reads failing header conversion.


<format defined> the annotation fields added are specific to the header format of the input file.

7.6.1 ConvertHeaders 454

Converts Roche 454 sequence headers.

usage: ConvertHeaders 454 [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME]





7.6. ConvertHeaders 53






7.6.2 ConvertHeaders genbank

Converts NCBI GenBank and RefSeq sequence headers.

usage: ConvertHeaders genbank [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME]









7.6.3 ConvertHeaders generic

Converts sequence headers without a known annotation system.

usage: ConvertHeaders generic [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed]



[--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME]









7.6.4 ConvertHeaders illumina

Converts Illumina sequence headers.

usage: ConvertHeaders illumina [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME]







7.6. ConvertHeaders 55




7.6.5 ConvertHeaders imgt

Converts sequence headers output by IMGT/GENE-DB.

usage: ConvertHeaders imgt [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] [--simple]









--simpleIf specified, only the allele name, and no other annotations, will appear in the converted sequence header.

7.6.6 ConvertHeaders sra

Converts NCBI SRA sequence headers.

usage: ConvertHeaders sra [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME]











7.7 EstimateError

Calculates annotation set error rates

usage: EstimateError [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--log LOG_FILE] [--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR] [--outname OUT_NAME][-f SET_FIELD] [-n MIN_COUNT] [--mode {freq,qual}][-q MIN_QUAL] [--freq MIN_FREQ] [--maxdiv MAX_DIVERSITY]








7.7. EstimateError 57



-f <set_field>The name of the annotation field to group sequences by

-n <min_count>The minimum number of sequences needed to consider a set

--mode {freq,qual}Specifies which method to use to determine the consensus sequence. The “freq” method will determine theconsensus by nucleotide frequency at each position and assign the most common value. The “qual” method willweight values by their quality scores to determine the consensus nucleotide at each position.

-q <min_qual>Consensus quality score cut-off under which an ambiguous character is assigned.

--freq <min_freq>Fraction of character occurrences under which an ambiguous character is assigned.

--maxdiv <max_diversity>Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and exclude groupswhich exceed the given diversity threshold.

output files:

error-position estimated error by read position.

error-quality estimated error by the quality score assigned within the input file.

error-nucleotide estimated error by nucleotide.

error-set estimated error by barcode read group size.

output fields:

POSITION read position with base zero indexing.

Q Phred quality score.

OBSERVED observed nucleotide value.

REFERENCE consensus nucleotide for the barcode read group.

SET_COUNT barcode read group size.

REPORTED_Q mean Phred quality score reported within the input file for the given position, quality score,nucleotide or read group.

MISMATCHES count of observed mismatches from consensus for the given position, quality score, nucleotideor read group.

OBSERVATIONS total count of observed values for each position, quality score, nucleotide or read groupsize.

ERROR estimated error rate.

EMPIRICAL_Q estimated error rate converted to a Phred quality score.

7.8 FilterSeq

Filters sequences in FASTA/FASTQ files



usage: FilterSeq [--version] [-h] ...



output files:

<command>-pass reads passing filtering operation and modified accordingly, where <command> is the nameof the filtering operation that was run.

<command>-fail raw reads failing filtering criteria, where <command> is the name of the filtering operation.


7.8.1 FilterSeq length

Filters reads by length.

usage: FilterSeq length [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE] [--nproc NPROC][--outdir OUT_DIR] [--outname OUT_NAME][-n MIN_LENGTH] [--inner]










-n <min_length>Minimum sequence length to retain.

--innerIf specified exclude consecutive missing characters at either end of the sequence.

7.8. FilterSeq 59


7.8.2 FilterSeq maskqual

Masks low quality positions.

usage: FilterSeq maskqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME] [-q MIN_QUAL]










-q <min_qual>Quality score threshold.

7.8.3 FilterSeq missing

Filters reads by N or gap character count.

usage: FilterSeq missing [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE] [--nproc NPROC][--outdir OUT_DIR] [--outname OUT_NAME][-n MAX_MISSING] [--inner]












-n <max_missing>Threshold for fraction of gap or N nucleotides.


7.8.4 FilterSeq quality

Filters reads by quality score.

usage: FilterSeq quality [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE] [--nproc NPROC][--outdir OUT_DIR] [--outname OUT_NAME] [-q MIN_QUAL][--inner]









7.8. FilterSeq 61





7.8.5 FilterSeq repeats

Filters reads by consecutive nucleotide repeats.

usage: FilterSeq repeats [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE] [--nproc NPROC][--outdir OUT_DIR] [--outname OUT_NAME][-n MAX_REPEAT] [--missing] [--inner]










-n <max_repeat>Threshold for fraction of repeating nucleotides.

--missingIf specified count consecutive gap and N characters ‘ in addition to {A,C,G,T}.




7.8.6 FilterSeq trimqual

Trims sequences by quality score decay.

usage: FilterSeq trimqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME] [-q MIN_QUAL] [--win WINDOW][--reverse]











--win <window>Nucleotide window size for moving average calculation.

--reverseSpecify to trim the head of the sequence rather than the tail.

7.9 MaskPrimers

Removes primers and annotates sequences with primer and barcode identifiers

usage: MaskPrimers [--version] [-h] ...



7.9. MaskPrimers 63


output files:

mask-pass processed reads with successful primer matches.

mask-fail raw reads failing primer identification.


SEQORIENT the orientation of the output sequence. Either F (input) or RC (reverse complement of input).

PRIMER name of the best primer match.

BARCODE the sequence preceding the primer match. Only output when the –barcode flag is specified.

7.9.1 MaskPrimers align

Find primer matches using pairwise local alignment.

usage: MaskPrimers align [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME] -p PRIMER_FILE[--mode {cut,mask,trim,tag}] [--revpr] [--barcode][--maxerror MAX_ERROR] [--maxlen MAX_LEN] [--skiprc][--gap GAP_PENALTY GAP_PENALTY]











-p <primer_file>A FASTA or REGEX file containing primer sequences.



--mode {cut,mask,trim,tag}Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region andthe preceding sequence. The “mask” mode will replace the primer region with Ns and remove the precedingsequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The“tag” mode will leave the input sequence unmodified.

--revprSpecify to match the tail-end of the sequence against the reverse complement of the primers. This also reversesthe behavior of the –maxlen argument, such that the search window begins at the tail-end of the sequence.

--barcodeSpecify to encode sequences with barcode sequences (unique molecular identifiers) found preceding the primerregion.

--maxerror <max_error>Maximum allowable error rate.

--maxlen <max_len>Length of the sequence window to scan for primers.

--skiprcSpecify to prevent checking of sample reverse complement sequences.

--gap <gap_penalty>A list of two positive values defining the gap open and gap extension penalties for aligning the primers. Note: theerror rate is calculated as the percentage of mismatches from the primer sequence with gap penalties reducingthe match count accordingly; this may lead to error rates that differ from strict mismatch percentage when gapsare present in the alignment.

7.9.2 MaskPrimers score

Find primer matches by scoring primers at a fixed position.

usage: MaskPrimers score [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed] [--log LOG_FILE][--delim DELIMITER DELIMITER DELIMITER][--nproc NPROC] [--outdir OUT_DIR][--outname OUT_NAME] -p PRIMER_FILE[--mode {cut,mask,trim,tag}] [--revpr] [--barcode][--maxerror MAX_ERROR] [--start START]







7.9. MaskPrimers 65






-p <primer_file>A FASTA or REGEX file containing primer sequences.

--mode {cut,mask,trim,tag}Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region andthe preceding sequence. The “mask” mode will replace the primer region with Ns and remove the precedingsequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The“tag” mode will leave the input sequence unmodified.

--revprSpecify to match the tail-end of the sequence against the reverse complement of the primers. This also reversesthe behavior of the –maxlen argument, such that the search window begins at the tail-end of the sequence.

--barcodeSpecify to encode sequences with barcode sequences (unique molecular identifiers) found preceding the primerregion.

--maxerror <max_error>Maximum allowable error rate.

--start <start>The starting position of the primer

7.10 PairSeq

Sorts and matches sequence records with matching coordinates across files

usage: PairSeq [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER] [--outdir OUT_DIR][--outname OUT_NAME] [--1f FIELDS_1 [FIELDS_1 ...]][--2f FIELDS_2 [FIELDS_2 ...]][--coord {illumina,solexa,sra,454,presto}]












--1f <fields_1>The annotation fields to copy from file 1 records into file 2 records. If a copied annotation already exists in a file2 record, then the annotations copied from file 1 will be added to the front of the existing annotation.

--2f <fields_2>The annotation fields to copy from file 2 records into file 1 records. If a copied annotation already exists in a file1 record, then the annotations copied from file 2 will be added to the end of the existing annotation.

--coord {illumina,solexa,sra,454,presto}The format of the sequence identifier which defines shared coordinate information across mate pairs.

output files:

pair-pass successfully paired reads with modified annotations.

pair-fail raw reads that could not be assigned to a mate-pair.


<user defined> annotation fields specified by the –1f or –2f arguments.

7.11 ParseHeaders

Parses pRESTO annotations in FASTA/FASTQ sequence headers

usage: ParseHeaders [--version] [-h] ...



output files:

reheader-pass reads passing annotation operation and modified accordingly.

reheader-fail raw reads failing annotation operation.

headers tab delimited table of the selected annotations.

7.11. ParseHeaders 67



<user defined> annotation fields specified by the -f argument.

7.11.1 ParseHeaders add

Adds field/value pairs to header annotations

usage: ParseHeaders add [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...] -u VALUES [VALUES ...]









-f <fields>List of fields to add.

-u <values>List of values to add for each field.

7.11.2 ParseHeaders collapse

Collapses header annotations with multiple entries

usage: ParseHeaders collapse [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...] --act{min,max,sum,first,last,set,cat}[{min,max,sum,first,last,set,cat} ...]











-f <fields>List of fields to collapse.

--act {min,max,sum,first,last,set,cat}List of actions to take for each field defining how each annotation will be combined into a single value. Theactions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. Theactions “first” and “last” choose the value from the corresponding position in the annotation. The action “set”collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the valuestogether into a single string.

7.11.3 ParseHeaders copy

Copies header annotation fields

usage: ParseHeaders copy [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...] -k NAMES [NAMES ...][--act {min,max,sum,first,last,set,cat} [{min,max,sum,

→˓first,last,set,cat} ...]]











-f <fields>List of fields to copy.

-k <names>List of names for each copied field. If the new field is already present, the copied field will be merged into theexisting field.

--act {min,max,sum,first,last,set,cat}List of collapse actions to take on each new field following the copy operation defining how each annotationwill be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathemat-ical operation on numeric annotations. The actions “first” and “last” choose the value from the correspondingposition in the annotation. The action “set” collapses annotations into a comma delimited list of unique values.The action “cat” concatenates the values together into a single string.

7.11.4 ParseHeaders delete

Deletes fields from header annotations

usage: ParseHeaders delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...]











-f <fields>List of fields to delete.

7.11.5 ParseHeaders expand

Expands annotation fields with multiple values

usage: ParseHeaders expand [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...] [--sep SEPARATOR]









-f <fields>List of fields to expand.

--sep <separator>The character separating each value in the fields.

7.11.6 ParseHeaders rename

Renames header annotation fields



usage: ParseHeaders rename [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--fasta] [--failed][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...] -k NAMES [NAMES ...][--act {min,max,sum,first,last,set,cat} [{min,max,sum,

→˓first,last,set,cat} ...]]









-f <fields>List of fields to rename.

-k <names>List of new names for each field. If the new field is already present, the renamed field will be merged into theexisting field and the old field will be deleted.

--act {min,max,sum,first,last,set,cat}List of collapse actions to take on each new field following the rename operation defining how each annotationwill be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathemat-ical operation on numeric annotations. The actions “first” and “last” choose the value from the correspondingposition in the annotation. The action “set” collapses annotations into a comma delimited list of unique values.The action “cat” concatenates the values together into a single string.

7.11.7 ParseHeaders table

Writes sequence headers to a table

usage: ParseHeaders table [--version] [-h] -s SEQ_FILES [SEQ_FILES ...][--failed] [--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELDS[FIELDS ...]










-f <fields>List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.

7.12 ParseLog

Parses records in the console log of pRESTO modules

usage: ParseLog [--version] [-h] [--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -l RECORD_FILES[RECORD_FILES ...] -f FIELDS [FIELDS ...]






-l <record_files>List of log files to parse.

-f <fields>List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.

output files:

7.12. ParseLog 73


table tab delimited table of the selected annotations.


<user defined> annotation fields specified by the -f argument.

7.13 SplitSeq

Sorts, samples and splits FASTA/FASTQ sequence files

usage: SplitSeq [--version] [-h] ...



output files:

part<part> reads partitioned by count, where <part> is the partition number.

<field>-<value> reads partitioned by annotation <field> and <value>.

under-<number> reads partitioned by numeric threshold where the annotation value is strictly less than thethreshold <number>.

atleast-<number> reads partitioned by numeric threshold where the annotation value is greater than or equalto the threshold <number>.

sorted reads sorted by annotation value.

sorted-part<part> reads sorted by annotation value and partitioned by count, where <part> is the partitionnumber.

sample<i>-n<count> randomly sampled reads where <i> is a number specifying the sampling instance and<count> is the number of sampled reads.

selected reads passing selection criteria.


7.13.1 SplitSeq count

Splits sequences files by number of records.

usage: SplitSeq count [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--outdir OUT_DIR] [--outname OUT_NAME] -n MAX_COUNT









-n <max_count>Maximum number of sequences in each new file

7.13.2 SplitSeq group

Splits sequences files by annotation.

usage: SplitSeq group [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD[--num THRESHOLD]








-f <field>Annotation field to split sequence files by

--num <threshold>Specify to define the split field as numeric and group sequences by value.

7.13.3 SplitSeq sample

Randomly samples from unpaired sequences files.

usage: SplitSeq sample [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -n MAX_COUNT[MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]

7.13. SplitSeq 75









-n <max_count>Maximum number of sequences to sample from each file, field or annotation set. The default behavior, withoutthe -f argument, is to sample from the complete set of sequences in the input file.

-f <field>The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be per-formed for each unique annotation value in the declared field separately.

-u <values>If specified, sampling will be restricted to sequences that contain one of the declared annotation values in thespecified field. Requires the -f argument.

7.13.4 SplitSeq samplepair

Randomly samples from paired-end sequences files.

usage: SplitSeq samplepair [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]-2 SEQ_FILES_2 [SEQ_FILES_2 ...] [--fasta][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -nMAX_COUNT [MAX_COUNT ...] [-f FIELD][-u VALUES [VALUES ...]][--coord {illumina,solexa,sra,454,presto}]











-n <max_count>Maximum number of paired sequences to sample from each set of files, fields or annotations. The defaultbehavior, without the -f argument, is to sample from the complete set of paired sequences in the input files.

-f <field>The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be per-formed for each unique annotation value in the declared field separately.

-u <values>If specified, sampling will be restricted to sequences that contain one of the declared annotation values in thespecified field. Requires the -f argument.

--coord {illumina,solexa,sra,454,presto}The format of the sequence identifier which defines shared coordinate information across paired read files.

7.13.5 SplitSeq select

Selects sequences from sequence files by annotation.

usage: SplitSeq select [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD[-u VALUE_LIST [VALUE_LIST ...] | -t VALUE_FILE][--not]







7.13. SplitSeq 77



-f <field>The annotation field for selection criteria.

-u <value_list>A list of values to select for in the specified field. Mutually exclusive with -t.

-t <value_file>A tab delimited file specifying values to select for in the specified field. The file must be formatted with thegiven field name in the header row. Values will be taken from that column. Mutually exclusive with -u.

--notIf specified, will perform negative matching. Meaning, sequences will be selected if they fail to match for allspecified values.

7.13.6 SplitSeq sort

Sorts sequences files by annotation.

usage: SplitSeq sort [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta][--delim DELIMITER DELIMITER DELIMITER][--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD[-n MAX_COUNT] [--num]








-f <field>The annotation field to sort sequences by.

-n <max_count>Maximum number of sequences in each new file.

--numSpecify to define the sort field as numeric rather than textual.


CHAPTER 8

API

8.1 presto.Annotation

Annotation functions

presto.Annotation.annotationConsensus(seq_iter, field, delimiter=(‘|’, ‘=’, ‘, ‘))Calculate a consensus annotation for a set of sequences

Parameters

• seq_iter – an iterator or list of SeqRecord objects

• field – the annotation field to take a consensus of

• delimiter – a tuple of delimiters for (annotations, field/values, value lists)

Returns

Dictionary with keys set containing a list of unique annotation values, count containing anno-tation counts, cons containing the consensus annotation, freq containing the majority anno-tation frequency

Return type dict

presto.Annotation.collapseAnnotation(ann_dict, action, fields=None, delimiter=(‘|’, ‘=’, ‘, ‘))Collapses multiple annotations into new single annotations for each field

Parameters

• ann_dict – Dictionary of field/value pairs

• action – Collapse action to take; one of {min, max, sum, first, last, set, cat}

• fields – Subset of ann_dict to _collapse; if None _collapse all but the ID field

• delimiter – Tuple of delimiters for (fields, values, value lists)

Returns Modified field dictionary

Return type OrderedDict

79

https://docs.python.org/2/library/stdtypes.html#dict


presto.Annotation.flattenAnnotation(ann_dict, delimiter=(‘|’, ‘=’, ‘, ‘))Converts annotations from a dictionary to a FASTA/FASTQ sequence description

Parameters



Returns Formatted sequence description string

Return type str

presto.Annotation.getAnnotationValues(seq_iter, field, unique=False, delimiter=(‘|’, ‘=’, ‘,‘))

Gets the set of unique annotation values in a sequence set

Parameters

• seq_iter – Iterator or list of SeqRecord objects

• field – Annotation field to retrieve values for

• unique – If True return a list of only the unique values; if False return a list of all values


Returns List of values for the field

Return type list

presto.Annotation.getCoordKey(header, coord_type=’presto’, delimiter=(‘|’, ‘=’, ‘, ‘))Return the coordinate identifier for a sequence description

Parameters

• header – Sequence header string

• coord_type – Sequence header format; one of [’illumina’, ‘solexa’, ‘sra’, ‘454’,‘presto’]; if unrecognized type or None return sequence ID.


Returns Coordinate identifier as a string

Return type str

presto.Annotation.mergeAnnotation(ann_dict_1, ann_dict_2, prepend=False, delimiter=(‘|’, ‘=’,‘, ‘))

Merges non-ID field annotations from one field dictionary into another

Parameters

• ann_dict_1 – Dictionary of field/value pairs to append to

• ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2

• prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values thatare already present, rather than the default behavior of appending ann_dict_2 values.


Returns Modified ann_dict_1 dictonary of field/value pairs


presto.Annotation.parseAnnotation(record, fields=None, delimiter=(‘|’, ‘=’, ‘, ‘))Extracts annotations from a FASTA/FASTQ sequence description

80 Chapter 8. API

https://docs.python.org/2/library/functions.html#str

https://docs.python.org/2/library/functions.html#list



Parameters

• record – Description string to extract annotations from

• fields – List of fields to subset the return dictionary to; if None return all fields

• delimiter – a tuple of delimiters for (fields, values, value lists)

Returns An OrderedDict of field/value pairs


presto.Annotation.renameAnnotation(ann_dict, old_field, new_field, delimiter=(‘|’, ‘=’, ‘, ‘))Renames an annotation and merges annotations if the new name already exists

Parameters


• old_field – Old field name

• new_field – New field name


Returns Modified fields dictonary


8.2 presto.Applications

External application wrappers

presto.Applications.makeBlastnDb(ref_file, db_exec=’makeblastdb’)Makes a ublast database file

Parameters

• ref_file – the path to the reference database file

• db_exec – the path to the makeblastdb executable

Returns (name and location of the database, handle of the tempfile.TemporaryDirectory)

Return type tuple

presto.Applications.makeUBlastDb(ref_file, db_exec=’usearch’)Makes a ublast database file

Parameters

• ref_file – path to the reference database file.

• db_exec – path to the usearch executable.

Returns (location of the database, handle of the tempfile.NamedTemporaryFile)

Return type tuple

presto.Applications.runBlastn(seq, database, evalue=1e-05, max_hits=100,aligner_exec=’blastn’)

Aligns a sequence against a reference database using BLASTN

Parameters

• seq – a list of SeqRecord objects to align.

8.2. presto.Applications 81

https://docs.python.org/2/library/functions.html#tuple



• database – the path and name of the blastn database.

• evalue – the E-value cut-off.

• maxhits – the maximum number of hits returned.

• aligner_exec – the path to the blastn executable.

Returns Alignment results.

Return type pandas.DataFrame

presto.Applications.runMuscle(seq_list, aligner_exec=’muscle’)Multiple aligns a set of sequences using MUSCLE

Parameters

• seq_list – a list of SeqRecord objects to align

• aligner_exec – the MUSCLE executable

Returns Multiple alignment results.

Return type Bio.Align.MultipleSeqAlignment

presto.Applications.runUBlast(seq, database, evalue=1e-05, max_hits=100,aligner_exec=’usearch’)

Aligns a sequence against a reference database using the usearch_local algorithm of USEARCH

Parameters

• seq – a list of SeqRecord objects to align.

• database – the path to the ublast database or a fasta file.

• evalue – the E-value cut-off.

• maxhits – the maximum number of hits returned.

• aligner_exec – the path to the usearch executable.

Returns Alignment results.

Return type pandas.DataFrame

presto.Applications.runUClust(seq_list, ident=0.9, seq_start=0, seq_end=None, clus-ter_exec=’usearch’)

Cluster a set of sequences using the UCLUST algorithm from USEARCH

Parameters

• seq_list – a list of SeqRecord objects to align.

• ident – the sequence identity cutoff to be passed to usearch.

• seq_start – the start position to trim sequences at before clustering.

• seq_end – the end position to trim sequences at before clustering.

• cluster_exec – the path to the usearch executable.

Returns {sequence id: cluster id}.

Return type dict

82 Chapter 8. API



8.3 presto.Commandline

Commandline interface functions

class presto.Commandline.CommonHelpFormatter(prog, indent_increment=2,max_help_position=24, width=None)

Bases: argparse.RawDescriptionHelpFormatter, argparse.ArgumentDefaultsHelpFormatter

Custom argparse.HelpFormatter

presto.Commandline.getCommonArgParser(seq_in=True, seq_out=True, paired=False,db_in=False, db_out=False, failed=True, log=True,annotation=True, multiproc=False, add_help=True)

Defines an ArgumentParser object with common pRESTO arguments

Parameters

• seq_in – If True include sequence input arguments

• seq_out – If True include sequence output arguments

• paired – If True defined paired-end sequence input and output arguments

• db_in – If True include tab delimited database input arguments

• db_out – If True include tab delimited database output arguments

• failed – If True include arguments for output of failed results

• log – If True include log arguments

• annotation – If True include annotation arguments

• multiproc – If True include multiprocessing arguments

• add_help – If True add help and version arguments

Returns An ArgumentParser object

Return type ArgumentParser

presto.Commandline.parseCommonArgs(args, in_arg=None, in_types=None)Checks common arguments from getCommonArgParser and transforms output options to a dictionary

Parameters

• args – Argument Namespace defined by ArgumentParser.parse_args

• in_arg – String defining a non-standard input file argument to verify; by default[’db_files’, ‘seq_files’, ‘seq_files_1’, ‘seq_files_2’, ‘primer_file’] are supported in that order

• in_types – List of types (file extensions as strings) to allow for files in file_arg if Nonedo not check type

Returns Dictionary copy of args with output arguments embedded in the dictionary out_args

Return type dict

8.4 presto.IO

File I/O and logging functions

8.3. presto.Commandline 83

https://docs.python.org/2/library/argparse.html#argparse.RawDescriptionHelpFormatter

https://docs.python.org/2/library/argparse.html#argparse.ArgumentDefaultsHelpFormatter

https://docs.python.org/2/library/argparse.html#argparse.ArgumentDefaultsHelpFormatter



presto.IO.countSeqFile(seq_file)Counts the records in FASTA/FASTQ files

Parameters seq_file – FASTA or FASTQ file containing sample sequences

Returns Count of records in the sequence file

Return type int

presto.IO.countSeqSets(seq_file, field=’BARCODE’, delimiter=(‘|’, ‘=’, ‘, ‘))Identifies sets of sequences with the same ID field

Parameters

• seq_file – FASTA or FASTQ file containing sample sequences

• field – Annotation field containing set IDs


Returns Count of unit set IDs in the sequence file

Return type int

presto.IO.getFileType(filename)Determines the type of a file by file extension

Parameters filename – Filename

Returns String defining the sequence type for SeqIO operations

Return type str

presto.IO.getOutputHandle(in_file, out_label=None, out_dir=None, out_name=None,out_type=None)

Opens an output file handle

Parameters

• in_file – Input filename

• out_label – Text to be inserted before the file extension; if None do not add a label

• out_type – the file extension of the output file; if None use input file extension

• out_dir – the output directory; if None use directory of input file

• out_name – the short filename to use for the output file; if None use input file short name

Returns File handle

Return type file

presto.IO.printLog(record, handle=<_io.TextIOWrapper name=’<stdout>’ mode=’w’encoding=’UTF-8’>, inset=None)

Formats a dictionary into an IgPipeline log string

Parameters

• record – a dict or OrderedDict of field names mapping to values

• handle – the file handle to write the log to; if None do not write to file

• inset – minimum field name inset; if None automatically space field names

Returns Formatted multi-line string in IgPipeline log format

Return type str

84 Chapter 8. API

https://docs.python.org/2/library/functions.html#int



https://docs.python.org/2/library/functions.html#file



presto.IO.printMessage(message, start_time=None, end=False, width=20)Prints a progress message to standard out

Parameters

• message – Current task message

• start_time – task start time returned by time.time(); if None do not add run time toprogress

• end – If True print final message (add newline)

• width – Maximum number of characters for messages

Returns None

presto.IO.printProgress(current, total=None, step=None, start_time=None, end=False)Prints a progress bar to standard out

Parameters

• current – Count of completed tasks

• total – Total task count; if None do not print percentage

• step – Float defining the fractional progress increment to print if total is defined; an intdefining the progress increment to print at if total is not defined; if None always output theprogress

• start_time – Task start time returned by time.time(); if None do not add run time toprogress

• end – if True print final log (add newline)

Returns None

presto.IO.readPrimerFile(primer_file)Processes primer sequences from file

Parameters primer_file – name of file containing primer sequences

Returns Dictionary mapping primer id to primer sequence

Return type dict

presto.IO.readReferenceFile(ref_file)Create a dictionary of cleaned and ungapped reference sequences.

Parameters ref_file – reference sequences in fasta format.

Returns

cleaned and ungapped reference sequences; with the key as the sequence ID and value as aBio.SeqRecord for each reference sequence.

Return type dict

presto.IO.readSeqFile(seq_file, index=False, key_func=None)Reads FASTA/FASTQ files

Parameters

• seq_file – FASTA or FASTQ file containing sample sequences

• index – If True return a dictionary from SeqIO.index(); if False return an iterator fromSeqIO.parse()

• key_func – the key_function argument to pass to SeqIO.index if index=True

8.4. presto.IO 85




Returns Tuple of (input file type, sequence record object)

Return type tuple

8.5 presto.Multiprocessing

Multiprocessing functions

class presto.Multiprocessing.SeqData(key, records)Bases: object

A class defining sequence data objects for worker processes

class presto.Multiprocessing.SeqResult(key, records)Bases: object

A class defining sequence result objects for collector processes

data_count

presto.Multiprocessing.collectSeqQueue(alive, result_queue, collect_queue, seq_file,task_label, out_args, index_field=None)

Pulls from results queue, assembles results and manages log and file IO

Parameters

• alive – a multiprocessing.Value boolean controlling whether processing continues; whenFalse function returns

• result_queue – Multiprocessing.Queue holding worker results

• collect_queue – Multiprocessing.Queue to store collector return values

• seq_file – Sample sequence file name

• task_label – Task label used to tag the output files

• out_args – Common output argument dictionary from parseCommonArgs

• index_field – Field defining set membership for sequence sets if None data queue con-tained individual records

Returns

Adds a dictionary with key value pairs to collect_queue containing ‘log’ defining a log ob-ject, ‘out_files’ defining the output file names

Return type None

presto.Multiprocessing.feedSeqQueue(alive, data_queue, seq_file, index_func=None, in-dex_args={})

Feeds the data queue with SeqRecord objects

Parameters

• alive – multiprocessing.Value boolean controlling whether processing continues; whenFalse function returns

• data_queue – multiprocessing.Queue to hold data for processing

• seq_file – Sequence file to read input from

• index_func – Function to use to define sequence sets if None do not index sets and feedindividual records

86 Chapter 8. API


https://docs.python.org/2/library/functions.html#object

https://docs.python.org/2/library/functions.html#object

https://docs.python.org/2/library/constants.html#None


• index_args – Dictionary of arguments to pass to index_func

Returns None

presto.Multiprocessing.manageProcesses(feed_func, work_func, collect_func, feed_args={},work_args={}, collect_args={}, nproc=None,queue_size=None)

Manages feeder, worker and collector processes

Parameters

• feed_func – Data Queue feeder function

• work_func – Worker function

• collect_func – Result Queue collector function

• feed_args – Dictionary of arguments to pass to feed_func

• work_args – Dictionary of arguments to pass to work_func

• collect_args – Dictionary of arguments to pass to collect_func

• nproc – Number of processQueue processes; if None defaults to the number of CPUs

• queue_size – Maximum size of the argument queue; if None defaults to 2*nproc

Returns Dictionary of collector results

Return type dict

presto.Multiprocessing.processSeqQueue(alive, data_queue, result_queue, process_func, pro-cess_args={})

Pulls from data queue, performs calculations, and feeds results queue

Parameters

• alive – multiprocessing.Value boolean controlling whether processing continues; whenFalse function returns

• data_queue – multiprocessing.Queue holding data to process

• result_queue – multiprocessing.Queue to hold processed results

• process_func – function to use for filtering sequences

• process_args – Dictionary of arguments to pass to process_func

Returns None

8.6 presto.Sequence

Sequence processing functions

presto.Sequence.calculateDiversity(seq_list, score_dict=getDNAScoreDict())Determine the average pairwise error rate for a list of sequences

Parameters

• seq_list – List of SeqRecord objects to score

• score_dict – Optional dictionary of alignment scores as {(char1, char2): score}

Returns Average pairwise error rate for the list of sequences

Return type float

8.6. presto.Sequence 87


https://docs.python.org/2/library/functions.html#float


presto.Sequence.calculateSetError(seq_list, ref_seq, ignore_chars=[’n’, ‘N’],score_dict=getDNAScoreDict())

Counts the occurrence of nucleotide mismatches from a reference in a set of sequences

Parameters

• seq_list – list of SeqRecord objects with aligned sequences.

• ref_seq – SeqRecord object containing the reference sequence to match against.

• ignore_chars – list of characters to exclude from mismatch counts.

• score_dict – optional dictionary of alignment scores as {(char1, char2): score}.

Returns error rate for the set.

Return type float

presto.Sequence.checkSeqEqual(seq1, seq2, ignore_chars={‘-‘, ‘n’, ‘.’, ‘N’})Determine if two sequences are equal, excluding missing positions

Parameters

• seq1 – SeqRecord object


• ignore_chars – Set of characters to ignore

Returns True if the sequences are equal

Return type bool

presto.Sequence.compilePrimers(primers)Translates IUPAC Ambiguous Nucleotide characters to regular expressions and compiles them

Parameters key – Dictionary of sequences to translate

Returns Dictionary of compiled regular expressions

Return type dict

presto.Sequence.deleteSeqPositions(seq, positions)Deletes a list of positions from a SeqRecord

Parameters

• seq – SeqRecord objects

• positions – Set of positions (indices) to delete

Returns Modified SeqRecord with the specified positions removed

Return type SeqRecord

presto.Sequence.findGapPositions(seq_list, max_gap, gap_chars={‘-‘, ‘.’})Finds positions in a set of aligned sequences with a high number of gap characters.

Parameters

• seq_list – List of SeqRecord objects with aligned sequences

• max_gap – Float of the maximum gap frequency to consider a position as non-gapped

• gap_chars – Set of characters to consider as gaps

Returns Positions (indices) with gap frequency greater than max_gap

Return type list

88 Chapter 8. API

https://docs.python.org/2/library/functions.html#float

https://docs.python.org/2/library/functions.html#bool




presto.Sequence.frequencyConsensus(seq_list, min_freq=0.6, ignore_chars={‘-‘, ‘n’, ‘.’, ‘N’})Builds a consensus sequence from a set of sequences

Parameters

• set_seq – List of SeqRecord objects

• min_freq – Frequency cutoff to assign a base

• ignore_chars – Set of characters to exclude when building a consensus sequence

Returns Consensus SeqRecord object


presto.Sequence.getAAScoreDict(mask_score=None, gap_score=None)Generates a score dictionary

Parameters

• mask_score – Tuple of length two defining scores for all matches against an X characterfor (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

• gap_score – Tuple of length two defining score for all matches against a [-, .] characterfor (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

Returns Score dictionary with keys (char1, char2) mapping to scores

Return type dict

presto.Sequence.getDNAScoreDict(mask_score=None, gap_score=None)Generates a score dictionary

Parameters

• mask_score – Tuple of length two defining scores for all matches against an N characterfor (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

• gap_score – Tuple of length two defining score for all matches against a [-, .] characterfor (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

Returns Score dictionary with keys (char1, char2) mapping to scores

Return type dict

presto.Sequence.indexSeqSets(seq_dict, field=’BARCODE’, delimiter=(‘|’, ‘=’, ‘, ‘))Identifies sets of sequences with the same ID field

Parameters

• seq_dict – a dictionary index of sequences returned from SeqIO.index()

• field – the annotation field containing set IDs

• delimiter – a tuple of delimiters for (fields, values, value lists)

Returns Dictionary mapping set name to a list of record names

Return type dict

presto.Sequence.qualityConsensus(seq_list, min_qual=20, min_freq=0.6, dependent=False,ignore_chars={‘-‘, ‘n’, ‘.’, ‘N’})

Builds a consensus sequence from a set of sequences






Parameters

• seq_list – List of SeqRecord objects

• min_qual – Quality cutoff to assign a base

• min_freq – Frequency cutoff to assign a base

• dependent – If False assume sequences are independent for quality calculation

• ignore_chars – Set of characters to exclude when building a consensus sequence

Returns Consensus SeqRecord object


presto.Sequence.reverseComplement(seq)Takes the reverse complement of a sequence

Parameters seq – a SeqRecord object, Seq object or string to reverse complement

Returns Object of the same type as the input with the reverse complement sequence

Return type Seq

presto.Sequence.scoreAA(a, b, mask_score=None, gap_score=None)Returns the score for a pair of IUPAC Extended Protein characters

Parameters

• a – First character

• b – Second character

• mask_score – Tuple of length two defining scores for all matches against an X characterfor (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

• gap_score – Tuple of length two defining score for all matches against a gap (-, .) charac-ter for (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

Returns Score for the character pair

Return type int

presto.Sequence.scoreDNA(a, b, mask_score=None, gap_score=None)Returns the score for a pair of IUPAC Ambiguous Nucleotide characters

Parameters

• a – First characters

• b – Second character

• n_score – Tuple of length two defining scores for all matches against an N characterfor (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

• gap_score – Tuple of length two defining score for all matches against a gap (-, .) charac-ter for (a, b), with the score for character (a) taking precedence; if None score symmetricallyaccording to IUPAC character identity

Returns Score for the character pair

Return type int

90 Chapter 8. API




presto.Sequence.scoreSeqPair(seq1, seq2, ignore_chars=set(), score_dict=getDNAScoreDict())Determine the error rate for a pair of sequences

Parameters



• ignore_chars – Set of characters to ignore when scoring and counting the weight

• score_dict – Optional dictionary of alignment scores

Returns Tuple of the (score, minimum weight, error rate) for the pair of sequences

Return type Tuple

presto.Sequence.subsetSeqIndex(seq_dict, field, values, delimiter=(‘|’, ‘=’, ‘, ‘))Subsets a sequence set by annotation value

Parameters

• seq_dict – Dictionary index of sequences returned from SeqIO.index()

• field – Annotation field to select keys by

• values – List of annotation values that define the retained keys

• delimiter – Tuple of delimiters for (annotations, field/values, value lists)

Returns List of keys

Return type list

presto.Sequence.subsetSeqSet(seq_iter, field, values, delimiter=(‘|’, ‘=’, ‘, ‘))Subsets a sequence set by annotation value

Parameters

• seq_iter – Iterator or list of SeqRecord objects

• field – Annotation field to select by

• values – List of annotation values that define the retained sequences

• delimiter – Tuple of delimiters for (annotations, field/values, value lists)

Returns Modified list of SeqRecord objects

Return type list

presto.Sequence.translateAmbigDNA(key)Translates IUPAC Ambiguous Nucleotide characters to or from character sets

Parameters key – String or re.search object containing the character set to translate

Returns Character translation

Return type str

presto.Sequence.weightSeq(seq, ignore_chars=set())Returns the length of a sequencing excluding ignored characters

Parameters

• seq – SeqRecord or Seq object

• ignore_chars – Set of characters to ignore when counting sequence length

Returns Sum of the character scores for the sequence






Return type int

92 Chapter 8. API


CHAPTER 9

Release Notes

9.1 Version 0.5.3: February 14, 2017

License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

AssemblePairs:

• Changed the behavior of the --failed argument so that failed output are in the same orientation as the inputsequences. Meaning, the --rc argument is ignored for failed output.

• Added the sequential subcommand which will first attempt de novo assembly (align subcommand) following byreference guided assembly (reference subcommand) if de novo assembly fails.

• Added blastn compatibility to reference subcommand.

• Added the option --aligner to the reference subcommand to allow use of either blastn or usearch for per-forming the local alignment. Defaults to the usearch algorithm used in previous releases.

• Added the option --dbexec to the reference subcommand to allow specification of the reference databasebuild tool (eg, makeblastdb).

• Changed masking behavior to none and word length to 9 in reference subcommand when using usearch as thealigner.

• Internal modifications to the reference subcommand to rebuild the database before alignments for performancereasons.

• Fixed a deprecation warning appearing with newer versions of numpy.

BuildConsensus:

• Fixed a bug in the read group error rate calculation wherein either a consensus sequence or read group thatwas completely N characters would cause the program to exit with a division by zero error. Now, such non-informative read groups will be assigned an error rate of 1.0.

ClusterSets:

• Added vsearch compatibility.

93


• Fixed a bug wherein sets containing empty sequences were being fed to usearch, rather than automatically failed,which would cause usearch v8 to hang indefinitely.

• Fixed an incompatibility with usearch v9 due to changes in the way usearch outputs sequence labels.

• Changed masking behavior of usearch to none.

• Changed how gaps are handling before passing sequences to usearch. Gaps are now masked (with Ns) forclustering, instead of removed.

EstimateError:

• Fixed a fatal error with newer versions of pandas.

SplitSeq:

• Added the select subcommand, which allows filtering of sequences based on annotation value matches or mis-matches.

• Altered the behavior of the -u argument for both the sample and samplepair subcommands. If -u is specified,sampling is performed as in previous versions wherein samples will be drawn from only fields with the specifiedannotation values up to n total reads. However, if -u is not specified with -f repeated sampling will nowbe performed for each unique annotation value in the specified field, generating output with up to n reads perunique annotation value.

9.2 Version 0.5.2: March 8, 2016

Fixed a bug with installation on Windows due to old file paths lingering in presto.egg-info/SOURCES.txt.

Improvements to commandline usage help messages.

Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0.

AssemblePairs:

• Added the flag --fill to the reference subcommand to allow insertion of the reference sequence into thenon-overlapping region of assembled sequences. Use caution when using this flag, as this may lead to chimericsequences.

• Changed default --minlen to 8 in align subcommand.

9.3 Version 0.5.1: December 4, 2015

ClusterSets:

• Fixed bug wherein --failed flag did not work.

9.4 Version 0.5.0: September 7, 2015

Conversion to a proper Python package which uses pip and setuptools for installation.

The package now requires Python 3.4. Python 2.7 is not longer supported.

The required dependency versions have been bumped to numpy 1.8, scipy 0.14, pandas 0.15, and biopython 1.65.

IgCore:

94 Chapter 9. Release Notes


• Divided IgCore functionality into the separate modules: Annotation, Commandline, Defaults, IO, Multiprocess-ing and Sequence.


Added support for additional input FASTA (.fna, .fa), FASTQ (.fq) and tab-delimited (.tsv) file extensions.

ParseHeaders:

• Fixed a bug in the rename subcommand wherein renaming to an existing field deleted the old annotation, butdid not merge the renamed annotation into the existing field.

• Added the copy subcommand which will copy annotations into new field names or merge the annotations ofexisting fields.

• Added the --act argument to the copy and rename subcommands allowing collapse following the copy orrename operation.

• Added a commandline check to ensure that the -f, -k and --act arguments contain the same number of fieldsfor both the rename and copy subcommands.

9.6 Version 0.4.7: June 5, 2015

IgCore:

• Modified scoring functions to permit asymmetrical scores for N and gap characters.

AssemblePairs:

• Added support for SRA style coordinate information where the where the read number has been appended tothe spot number.

• Altered scoring so gap characters are counted as mismatches in the error rate and identity calculations.

BuildConsensus:

• Altered scoring so gap characters are counted as mismatches in the diversity and error rate calculations.

ConvertHeaders:

• Added support for SRA style sequence headers where the read number has been appended to the spot number;eg, output from fastq-dump -I --split-files file.sra.

ClusterSets:

• Added missing OUTPUT console log field.

• Changed --bf and --cf arguments to -f and -k, respectively.

MaskPrimers:

• Altering scoring behavior for N characters such that Ns in the input sequence are always counted as a mismatch,while Ns in the primer sequence are counted as a match, with priority given to the input sequence score.

• Added --gap argument to the align subcommand which allows users to specify the gap open and gap extensionpenalties for aligning primers. Note: gap penalties reduce the match count for purposes of calculating ERROR.

PairSeq:

• Added support for SRA style coordinate information where the where the read number has been appended tothe spot number.

9.5. Version 0.4.8: September 7, 2015 95


9.7 Version 0.4.6: May 13, 2015

BuildConsensus:

• Changed --maxmiss argument to --maxgap and altered the behavior to only perform deletion of positionsbased on gap characters (only “-” or ”.” and not “N” characters).

• Added an error rate (--maxerror) calculation based on mismatches from consensus. The --maxerrorargument is mutually exclusive with the --maxdiv argument and provides similar functionality. However, thecalculations are not equivalent, and --maxerror should be considerably faster than --maxdiv.

• Added exclusion of positions from the error rate calculation that are deleted due to exceeding the --maxgapthreshold .

• Fixed misalignment of consensus sequence against input sequences when positions are deleted due to exceedingthe --maxgap threshold.

ClusterSets:

• New script to cluster read groups by barcode field (eg, UID barcodes) into clustering within the read group.

ConvertHeaders:

• New script to handle conversion of different sequence description formats to the pRESTO format.

FilterSeq:

• Added count of masked characters to log output of maskqual subcommand.

• Changed repeats subcommand log field REPEAT to REPEATS.

PairSeq:

• Changed -f argument to --1f argument.

• Added --2f argument to copy file 2 annotations to file 1.

ParseHeaders:

• Moved convert subcommand to the generic subcommand of the new ConvertHeaders script and modified theconversion behavior.

9.8 Version 0.4.5: March 20, 2015

Added details to the usage documentation for each tool which describes both the output files and annotation fields.

Renamed --clean argument to --failed argument with opposite behavior, such that the default behavior of allscripts is now clean output.

IgCore:

• Features added for Change-O compatibility.

• Features added for PairSeq performance improvements.

• Added custom help formatter.

• Modifications to internals of multiprocessing code.

• Fixed a few typos in error messages.

AssemblePairs:

• Added reference subcommand which uses V-region germline alignments from ublast to assemble paired-ends.



• Removed mate-pair matching operation to increase performance. Now requires both input files to containmatched and uniformly ordered reads. If files are not synchronized then PairSeq must be run first. Assem-blePairs will check that coordinate info matches and error if the files are not synchronized. Unpaired reads areno longer output.

• Added support for cases where one mate pair is the subsequence of the other.

• Added --scanrev argument to allow for head sequence to overhang end of tail.

• Removed truncated (quick) error calculation in align subcommand.

• Changed default values of the --maxerror and --alpha arguments of the align subcommand to better tunedparameters.

• Changed internal selection of top scoring alignment to use Z-score approximation rather than a combination oferror rate and binomial mid-p value.

• Internal changes to multiprocessing structure.

• Changed inserted gap character from - to N in join subcommand for better compatibility with the behavior ofIMGT/HighV-QUEST.

• Changed PVAL log field to PVALUE.

• Changed HEADSEQ and TAILSEQ log fields to SEQ1 and SEQ2.

• Changed HEADFIELDS and TAILFIELDS log fields to FIELDS1 and FIELDS2.

• Changed precision of ERROR and PVALUE log fields.

• Added more verbose logging.

BuildConsensus:

• Fixed bug where low quality positions where not being masked in single sequence barcode groups.

• Added copy field (--cf) and copy action (--act) arguments to generate consensus annotations for barcoderead groups.

• Changed maximum consensus quality score from 93 to 90.

CollapseSeq:

• Added --keep argument to allow retention of sequences with high missing character counts in unique sequenceoutput file.

• Removed case insensitivity for performance reasons. Now requires all sequences to have matching case.

• Removed first and last from --act choices to avoid unexpected behavior.

MaskPrimers:

• Changed behavior of N characters in primer identification. Ns now count as a match against any character, ratherthan a mismatch.

• Changed behavior of mask mode such that positions masked with Ns are now assigned quality scores of 0, ratherthan retaining their previous scores.

• Fixed a bug with the align subcommand where deletions within the input sequence (gaps in the alignment) werecausing an incorrect barcode start position.

PairSeq:

• Performance improvements. The tool should now be considerably faster on very large files.

• Specifying the --failed argument to request output of sequences which do not have a mate pair will increaserun time and memory usage.

9.8. Version 0.4.5: March 20, 2015 97


ParseHeaders:

• Add ‘cat’ action to collapse subcommand which concatenates strings into a single annotation.

SplitSeq:

• Removed --clean (and --failed) flag from all subcommands.

• Added progress updates to sample and samplepair subcommands.

• Performance improvements to samplepair subcommand.

9.9 Version 0.4.4: June 10, 2014

SplitSeq:

• Removed a linux-specific dependency, allowing SplitSeq to work on Windows.

9.10 Version 0.4.3: April 7, 2014

CollapseSeq:

• Fixed bug that occurs with Python 2.7.5 on OS X.

SplitSeq:

• Fixed bug in samplepairs subcommand that occurs with Python 2.7.5 on OS X.

9.11 Version 0.4.2: March 20, 2014

Increased verbosity of exception reporting.

IgCore:

• Updates to consensus functions to support changes to BuildConsensus.

AssemblePairs:

• Set default alpha to 0.01.

BuildConsensus:

• Added support for --freq value parameter to quality consensus method and set default value to 0.6.

• Fixed a bug in the frequency consensus method where missing values were contributing to the total charactercount at each position.

• Added the parameter --maxmiss value which provides a cut-off for removal of positions with too many Nor gap characters .

MaskPrimers:

• Renamed the --reverse parameter to --revpr.

SplitSeq:

• Removed convert subcommand.



9.12 Version 0.4.1: January 27, 2014

Changes to the internals of multiple tools to provide support for multiprocessing in Windows environments.

Changes to the internals of multiple tools to provide clean exit of child processes upon kill signal or exception insibling process.

Fixed unexpected behavior of --outname and --log arguments with multiple input files.

IgCore:

• Added reporting of unknown exceptions when reading sequence files

• Fixed scoring of lowercase sequences.

AlignSets:

• Fixed a typo in the log output.

BuildConsensus:

• Fixed a typo in the log output.

EstimateError:

• Fixed bug where tool would improperly exit if no sets passed threshold criteria.

• Fixed typo in console output.

MaskPrimers:

• Added trim mode which will cut the region before primer alignment, but leave primer region unmodified.

• Fixed a bug with lowercase sequence data.

• Fixed bug in the console and log output.

• Added support for primer matching when setting --maxerr 1.0.

ParseHeaders:

• Added count of sequences without any valid fields (FAIL) to console output.

ParseLog:

• Added count of records without any valid fields (FAIL) to console output.

SplitSeq:

• Fixed typo in console output of samplepair subcommand.

• Added increase of the open file limit to the group subcommand to allow for a large number of groups.


Minor name changes were made to multiple scripts, functions, parameters, and output files.

AlignSets, AssemblePairs, BuildConsensus, EstimateError, FilterSeq, and MaskPrimers are now multithreaded. Thenumber of simultaneous processes may be specified using --nproc value. Note this means file ordering is nolonger preserved between the input and output sequence files.

Performance improvements were made to several tools.

The universal --verbose parameter was replaced with --log file_name which specifies a log file for verboseoutput, and disables verbose logging if not specified.

9.12. Version 0.4.1: January 27, 2014 99


The report of input parameters and sequence counts is now separate from the log and is always printed to standardoutput.

Added a progress bar to the standard output of most tools.

Added a universal --outname file_prefix parameter which changes the leading portion of the output filename. If not specified, the current file name is used (excluding the file extension, as per the previous behavior).

Added a universal --clean parameter which if specified forces the tool not to create an output file of sequenceswhich failed processing.

IgCore:

• Changes to parameters and internals of multiple functions.

• Added functions to support multithreading for single-end reads, paired-end reads, and barcode sets.

• Added safe annotation field renaming.

• Added progress bar, logging and output file name conversion support.

• Moved reusable AssemblePairs, BuildConsensus, PairSeq, and SplitSeq. operations into IgCore.

AssemblePairs:

• Coordinate information is now specified by a coordinate type, rather than a delimiter, using the --coordheader_type parameter, where the header type may be one of illumina, solexa, sra, 454, presto.

CollapseSeq:

• Sequences with a missing character count exceeding the user limit defined by -nmaximum_missing_count are now exported to a separate collapse-undetermined outputfile, rather than included in the collapse-unique sequence output.

EstimateError:

• Now outputs error estimations for positions, quality scores, nucleotide pairs, and annotation sets.

• Machine reported quality scores and empirical quality scores have been added to all output tables.

FilterSeq:

• Added length subcommand to filter sequences by minimum length.

PairSeq:

• Coordinate information has been redefined as per AssemblePairs.

ParseHeaders:

• Added new subcommand convert which attempts to reformat sequence headers into the pRESTO format.

• The rename subcommand will now append entries if the new field name already exists in the sequence header,rather than replace the entry.

9.14 Version 0.3 (prerelease 6): August 13, 2013

Toolkit is now dependent upon pandas 0.12 for the estimateError tool.

alignSets:

• Changed MUSCLE execution to faster settings (-diags, -maxiters 2).

filterQuality:

• Added repeat subcommand to filter sequences with -n (value) repetitions of a single character and.



• Changed -n parameter of ambig subcommand from fractional value to a raw count.

estimateError:

• New tool which estimates error of sequence sets by comparison to a consensus.

maskPrimers:

• Bug fixes to alignment position calculation of align subcommand when primer alignment begins before startof sequence.

• Removed --ann parameter.

9.15 Version 0.3 (prerelease 5): August 7, 2013

License changed to Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

IgPipeline Core:

• Bug fixes to diversity calculation.

• Added support for files where all sequences do not share the same annotation fields.

• Added support for alternate scoring of gap and N-valued nucleotides.

alignSets:

• Added --mode parameter with options of pad and cut to specify whether to extend or trim read groups to thesame start position.

• Fixed intermittent ‘muscle’ subcommand stdout pipe deadlock when executing MUSCLE.

assemblePairs:

• Added join subcommand to support library preps where paired-end reads do not overlap.

• Speed improvements to p-value calculations.

buildConsensus:

• --div parameter converted to --maxdiv value to allow filtering of read groups by diversity.

• Bug fixes to nucleotide frequency consensus method.

• -q parameter renamed to --qual.

collapseSequences:

• Added support for files where all sequences do not share the same annotation fields.

splitSeqFile:

• samplepair subcommand added to allow random sampling from paired-end file sets.

• The behavior of the -c parameter of the sample and samplepair subcommands changed to allow multiplesamplings with the same command.

9.16 Version 0.3 (prerelease 4): May 18, 2013

Initial public prerelease

9.15. Version 0.3 (prerelease 5): August 7, 2013 101



CHAPTER 10

Contact

If you have questions you can email Steven Kleinstein and/or Jason Vander Heiden.

If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.

103

mailto:[email protected]

mailto:[email protected]

http://bitbucket.org/kleinstein/presto/issues


104 Chapter 10. Contact

CHAPTER 11

Citation

To cite pRESTO in publications please use:

Vander Heiden JA*, Yaari G*, Uduman M, Stern JNH, O’Connor KC, Hafler DA, Vigneault F, KleinsteinSH. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires.Bioinformatics 2014; doi: 10.1093/bioinformatics/btu138

105


106 Chapter 11. Citation

CHAPTER 12

License

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license

107

https://creativecommons.org/licenses/by-sa/4.0/


108 Chapter 12. License

CHAPTER 13

Indices

• genindex

• modindex

109


110 Chapter 13. Indices

Python Module Index

ppresto.Annotation, 79presto.Applications, 81presto.Commandline, 83presto.IO, 83presto.Multiprocessing, 86presto.Sequence, 87

111

Index

Symbols–1f <fields_1>

PairSeq command line option, 67–1f <head_fields>

AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 47

–2f <fields_2>PairSeq command line option, 67

–2f <tail_fields>AssemblePairs-align command line option, 43AssemblePairs-join command line option, 45AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 47

–act {min,max,sum,first,last,set,cat}ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 70ParseHeaders-rename command line option, 72

–act {min,max,sum,set,majority}BuildConsensus command line option, 49

–act {min,max,sum,set}CollapseSeq command line option, 52

–aligner {blastn,usearch}AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–alpha <alpha>AssemblePairs-align command line option, 43AssemblePairs-sequential command line option, 47

–barcodeMaskPrimers-align command line option, 65MaskPrimers-score command line option, 66

–bf <barcode_field>AlignSets-muscle command line option, 40AlignSets-offset command line option, 41BuildConsensus command line option, 49

–cf <copy_fields>BuildConsensus command line option, 49CollapseSeq command line option, 52

–coord {illumina,solexa,sra,454,presto}AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 47PairSeq command line option, 67SplitSeq-samplepair command line option, 77

–dbexec <db_exec>AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–delim <delimiter>AlignSets-muscle command line option, 40AlignSets-offset command line option, 41AlignSets-table command line option, 41AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 51CollapseSeq command line option, 52ConvertHeaders-454 command line option, 54ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 57EstimateError command line option, 57MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65PairSeq command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 70ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73ParseLog command line option, 73SplitSeq-group command line option, 75

112


SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 77SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

–depBuildConsensus command line option, 50

–divAlignSets-muscle command line option, 40AlignSets-offset command line option, 41

–end <seq_end>ClusterSets command line option, 51

–evalue <evalue>AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–exec <aligner_exec>AlignSets-muscle command line option, 40AlignSets-table command line option, 42AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–exec <cluster_exec>ClusterSets command line option, 51

–failedAlignSets-muscle command line option, 40AlignSets-offset command line option, 41AlignSets-table command line option, 41AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 50CollapseSeq command line option, 52ConvertHeaders-454 command line option, 53ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 57FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 61FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65PairSeq command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 69ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73

–fastaAlignSets-muscle command line option, 40AlignSets-offset command line option, 41AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 50CollapseSeq command line option, 52ConvertHeaders-454 command line option, 53ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 57FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 60FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65PairSeq command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 69ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72SplitSeq-count command line option, 74SplitSeq-group command line option, 75SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 76SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

–fillAssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–freq <min_freq>BuildConsensus command line option, 49EstimateError command line option, 58

–gap <gap_penalty>MaskPrimers-align command line option, 65

–gap <gap>AssemblePairs-join command line option, 45

–id <ident>ClusterSets command line option, 51

–innerCollapseSeq command line option, 52FilterSeq-length command line option, 59FilterSeq-missing command line option, 61FilterSeq-quality command line option, 62

Index 113


FilterSeq-repeats command line option, 62–keepmiss

CollapseSeq command line option, 52–log <log_file>

AlignSets-muscle command line option, 40AlignSets-offset command line option, 41AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 51CollapseSeq command line option, 52EstimateError command line option, 57FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 61FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65

–maxdiv <max_diversity>BuildConsensus command line option, 50EstimateError command line option, 58

–maxerror <max_error>AssemblePairs-align command line option, 43AssemblePairs-sequential command line option, 47BuildConsensus command line option, 50MaskPrimers-align command line option, 65MaskPrimers-score command line option, 66

–maxf <max_field>CollapseSeq command line option, 52

–maxgap <max_gap>BuildConsensus command line option, 49

–maxhits <max_hits>AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–maxlen <max_len>AssemblePairs-align command line option, 43AssemblePairs-sequential command line option, 48MaskPrimers-align command line option, 65

–minf <min_field>CollapseSeq command line option, 52

–minident <min_ident>AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

–minlen <min_len>AssemblePairs-align command line option, 43AssemblePairs-sequential command line option, 47

–missingFilterSeq-repeats command line option, 62

–mode {cut,mask,trim,tag}MaskPrimers-align command line option, 64

MaskPrimers-score command line option, 66–mode {freq,qual}

EstimateError command line option, 58–mode {pad,cut}

AlignSets-offset command line option, 41–not

SplitSeq-select command line option, 78–nproc <nproc>

AlignSets-muscle command line option, 40AlignSets-offset command line option, 41AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 51EstimateError command line option, 57FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 61FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 66

–numSplitSeq-sort command line option, 78

–num <threshold>SplitSeq-group command line option, 75

–outdir <out_dir>AlignSets-muscle command line option, 40AlignSets-offset command line option, 41AlignSets-table command line option, 42AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 51CollapseSeq command line option, 52ConvertHeaders-454 command line option, 54ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 57EstimateError command line option, 57FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 61FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64

114 Index


MaskPrimers-score command line option, 66PairSeq command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 70ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73ParseLog command line option, 73SplitSeq-count command line option, 74SplitSeq-group command line option, 75SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 77SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

–outname <out_name>AlignSets-muscle command line option, 40AlignSets-offset command line option, 41AlignSets-table command line option, 42AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 51CollapseSeq command line option, 52ConvertHeaders-454 command line option, 54ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 56ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 57EstimateError command line option, 57FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 61FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 66PairSeq command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 70ParseHeaders-delete command line option, 71ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73ParseLog command line option, 73SplitSeq-count command line option, 75SplitSeq-group command line option, 75SplitSeq-sample command line option, 76

SplitSeq-samplepair command line option, 77SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

–pf <primer_field>AlignSets-offset command line option, 41BuildConsensus command line option, 49

–prcons <primer_freq>BuildConsensus command line option, 49

–rc {head,tail,both}AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 47

–reverseAlignSets-table command line option, 42FilterSeq-trimqual command line option, 63

–revprMaskPrimers-align command line option, 65MaskPrimers-score command line option, 66

–scanrevAssemblePairs-align command line option, 43AssemblePairs-sequential command line option, 48

–sep <separator>ParseHeaders-expand command line option, 71

–simpleConvertHeaders-imgt command line option, 56

–skiprcMaskPrimers-align command line option, 65

–start <seq_start>ClusterSets command line option, 51

–start <start>MaskPrimers-score command line option, 66

–uf <uniq_fields>CollapseSeq command line option, 52

–versionAlignSets command line option, 39AlignSets-muscle command line option, 39AlignSets-offset command line option, 40AlignSets-table command line option, 41AssemblePairs command line option, 42AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 48ClusterSets command line option, 50CollapseSeq command line option, 52ConvertHeaders command line option, 53ConvertHeaders-454 command line option, 53ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 56

Index 115


EstimateError command line option, 57FilterSeq command line option, 59FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 60FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65PairSeq command line option, 66ParseHeaders command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 68ParseHeaders-copy command line option, 69ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 72ParseLog command line option, 73SplitSeq command line option, 74SplitSeq-count command line option, 74SplitSeq-group command line option, 75SplitSeq-sample command line option, 75SplitSeq-samplepair command line option, 76SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

–win <window>FilterSeq-trimqual command line option, 63

-1 <seq_files_1>AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47PairSeq command line option, 66SplitSeq-samplepair command line option, 76

-2 <seq_files_2>AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47PairSeq command line option, 66SplitSeq-samplepair command line option, 76

-d <offset_table>AlignSets-offset command line option, 41

-f <barcode_field>ClusterSets command line option, 51

-f <field>SplitSeq-group command line option, 75SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 77SplitSeq-select command line option, 78SplitSeq-sort command line option, 78

-f <fields>ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 70ParseHeaders-delete command line option, 71ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73ParseLog command line option, 73

-f <set_field>EstimateError command line option, 58

-h, –helpAlignSets command line option, 39AlignSets-muscle command line option, 40AlignSets-offset command line option, 40AlignSets-table command line option, 41AssemblePairs command line option, 42AssemblePairs-align command line option, 43AssemblePairs-join command line option, 44AssemblePairs-reference command line option, 45AssemblePairs-sequential command line option, 47BuildConsensus command line option, 49ClusterSets command line option, 50CollapseSeq command line option, 52ConvertHeaders command line option, 53ConvertHeaders-454 command line option, 53ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 56EstimateError command line option, 57FilterSeq command line option, 59FilterSeq-length command line option, 59FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 60FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65PairSeq command line option, 66ParseHeaders command line option, 67ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 69ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73ParseLog command line option, 73SplitSeq command line option, 74SplitSeq-count command line option, 74

116 Index


SplitSeq-group command line option, 75SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 76SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

-k <cluster_field>ClusterSets command line option, 51

-k <names>ParseHeaders-copy command line option, 70ParseHeaders-rename command line option, 72

-l <record_files>ParseLog command line option, 73

-n <max_count>SplitSeq-count command line option, 75SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 77SplitSeq-sort command line option, 78

-n <max_missing>CollapseSeq command line option, 52FilterSeq-missing command line option, 61

-n <max_repeat>FilterSeq-repeats command line option, 62

-n <min_count>BuildConsensus command line option, 49EstimateError command line option, 58

-n <min_length>FilterSeq-length command line option, 59

-p <primer_file>AlignSets-table command line option, 42MaskPrimers-align command line option, 64MaskPrimers-score command line option, 66

-q <min_qual>BuildConsensus command line option, 49EstimateError command line option, 58FilterSeq-maskqual command line option, 60FilterSeq-quality command line option, 62FilterSeq-trimqual command line option, 63

-r <ref_file>AssemblePairs-reference command line option, 46AssemblePairs-sequential command line option, 48

-s <seq_files>AlignSets-muscle command line option, 40AlignSets-offset command line option, 40BuildConsensus command line option, 49ClusterSets command line option, 50CollapseSeq command line option, 52ConvertHeaders-454 command line option, 53ConvertHeaders-genbank command line option, 54ConvertHeaders-generic command line option, 55ConvertHeaders-illumina command line option, 55ConvertHeaders-imgt command line option, 56ConvertHeaders-sra command line option, 57EstimateError command line option, 57FilterSeq-length command line option, 59

FilterSeq-maskqual command line option, 60FilterSeq-missing command line option, 60FilterSeq-quality command line option, 61FilterSeq-repeats command line option, 62FilterSeq-trimqual command line option, 63MaskPrimers-align command line option, 64MaskPrimers-score command line option, 65ParseHeaders-add command line option, 68ParseHeaders-collapse command line option, 69ParseHeaders-copy command line option, 69ParseHeaders-delete command line option, 70ParseHeaders-expand command line option, 71ParseHeaders-rename command line option, 72ParseHeaders-table command line option, 73SplitSeq-count command line option, 74SplitSeq-group command line option, 75SplitSeq-sample command line option, 76SplitSeq-select command line option, 77SplitSeq-sort command line option, 78

-t <value_file>SplitSeq-select command line option, 78

-u <value_list>SplitSeq-select command line option, 78

-u <values>ParseHeaders-add command line option, 68SplitSeq-sample command line option, 76SplitSeq-samplepair command line option, 77

AAlignSets command line option

–version, 39-h, –help, 39

AlignSets-muscle command line option–bf <barcode_field>, 40–delim <delimiter>, 40–div, 40–exec <aligner_exec>, 40–failed, 40–fasta, 40–log <log_file>, 40–nproc <nproc>, 40–outdir <out_dir>, 40–outname <out_name>, 40–version, 39-h, –help, 40-s <seq_files>, 40

AlignSets-offset command line option–bf <barcode_field>, 41–delim <delimiter>, 41–div, 41–failed, 41–fasta, 41–log <log_file>, 41–mode {pad,cut}, 41

Index 117


–nproc <nproc>, 41–outdir <out_dir>, 41–outname <out_name>, 41–pf <primer_field>, 41–version, 40-d <offset_table>, 41-h, –help, 40-s <seq_files>, 40

AlignSets-table command line option–delim <delimiter>, 41–exec <aligner_exec>, 42–failed, 41–outdir <out_dir>, 42–outname <out_name>, 42–reverse, 42–version, 41-h, –help, 41-p <primer_file>, 42

annotationConsensus() (in module presto.Annotation), 79AssemblePairs command line option


AssemblePairs-align command line option–1f <head_fields>, 43–2f <tail_fields>, 43–alpha <alpha>, 43–coord {illumina,solexa,sra,454,presto}, 43–delim <delimiter>, 43–failed, 43–fasta, 43–log <log_file>, 43–maxerror <max_error>, 43–maxlen <max_len>, 43–minlen <min_len>, 43–nproc <nproc>, 43–outdir <out_dir>, 43–outname <out_name>, 43–rc {head,tail,both}, 43–scanrev, 43–version, 43-1 <seq_files_1>, 43-2 <seq_files_2>, 43-h, –help, 43

AssemblePairs-join command line option–1f <head_fields>, 44–2f <tail_fields>, 45–coord {illumina,solexa,sra,454,presto}, 44–delim <delimiter>, 44–failed, 44–fasta, 44–gap <gap>, 45–log <log_file>, 44–nproc <nproc>, 44–outdir <out_dir>, 44

–outname <out_name>, 44–rc {head,tail,both}, 44–version, 44-1 <seq_files_1>, 44-2 <seq_files_2>, 44-h, –help, 44

AssemblePairs-reference command line option–1f <head_fields>, 46–2f <tail_fields>, 46–aligner {blastn,usearch}, 46–coord {illumina,solexa,sra,454,presto}, 46–dbexec <db_exec>, 46–delim <delimiter>, 45–evalue <evalue>, 46–exec <aligner_exec>, 46–failed, 45–fasta, 45–fill, 46–log <log_file>, 45–maxhits <max_hits>, 46–minident <min_ident>, 46–nproc <nproc>, 45–outdir <out_dir>, 45–outname <out_name>, 45–rc {head,tail,both}, 46–version, 45-1 <seq_files_1>, 45-2 <seq_files_2>, 45-h, –help, 45-r <ref_file>, 46

AssemblePairs-sequential command line option–1f <head_fields>, 47–2f <tail_fields>, 47–aligner {blastn,usearch}, 48–alpha <alpha>, 47–coord {illumina,solexa,sra,454,presto}, 47–dbexec <db_exec>, 48–delim <delimiter>, 47–evalue <evalue>, 48–exec <aligner_exec>, 48–failed, 47–fasta, 47–fill, 48–log <log_file>, 47–maxerror <max_error>, 47–maxhits <max_hits>, 48–maxlen <max_len>, 48–minident <min_ident>, 48–minlen <min_len>, 47–nproc <nproc>, 47–outdir <out_dir>, 47–outname <out_name>, 47–rc {head,tail,both}, 47–scanrev, 48

118 Index


–version, 47-1 <seq_files_1>, 47-2 <seq_files_2>, 47-h, –help, 47-r <ref_file>, 48

BBuildConsensus command line option

–act {min,max,sum,set,majority}, 49–bf <barcode_field>, 49–cf <copy_fields>, 49–delim <delimiter>, 49–dep, 50–failed, 49–fasta, 49–freq <min_freq>, 49–log <log_file>, 49–maxdiv <max_diversity>, 50–maxerror <max_error>, 50–maxgap <max_gap>, 49–nproc <nproc>, 49–outdir <out_dir>, 49–outname <out_name>, 49–pf <primer_field>, 49–prcons <primer_freq>, 49–version, 48-h, –help, 49-n <min_count>, 49-q <min_qual>, 49-s <seq_files>, 49

CcalculateDiversity() (in module presto.Sequence), 87calculateSetError() (in module presto.Sequence), 87checkSeqEqual() (in module presto.Sequence), 88ClusterSets command line option

–delim <delimiter>, 51–end <seq_end>, 51–exec <cluster_exec>, 51–failed, 50–fasta, 50–id <ident>, 51–log <log_file>, 51–nproc <nproc>, 51–outdir <out_dir>, 51–outname <out_name>, 51–start <seq_start>, 51–version, 50-f <barcode_field>, 51-h, –help, 50-k <cluster_field>, 51-s <seq_files>, 50

collapseAnnotation() (in module presto.Annotation), 79CollapseSeq command line option

–act {min,max,sum,set}, 52–cf <copy_fields>, 52–delim <delimiter>, 52–failed, 52–fasta, 52–inner, 52–keepmiss, 52–log <log_file>, 52–maxf <max_field>, 52–minf <min_field>, 52–outdir <out_dir>, 52–outname <out_name>, 52–uf <uniq_fields>, 52–version, 52-h, –help, 52-n <max_missing>, 52-s <seq_files>, 52

collectSeqQueue() (in module presto.Multiprocessing),86

CommonHelpFormatter (class in presto.Commandline),83

compilePrimers() (in module presto.Sequence), 88ConvertHeaders command line option


ConvertHeaders-454 command line option–delim <delimiter>, 54–failed, 53–fasta, 53–outdir <out_dir>, 54–outname <out_name>, 54–version, 53-h, –help, 53-s <seq_files>, 53

ConvertHeaders-genbank command line option–delim <delimiter>, 54–failed, 54–fasta, 54–outdir <out_dir>, 54–outname <out_name>, 54–version, 54-h, –help, 54-s <seq_files>, 54

ConvertHeaders-generic command line option–delim <delimiter>, 55–failed, 55–fasta, 55–outdir <out_dir>, 55–outname <out_name>, 55–version, 55-h, –help, 55-s <seq_files>, 55

ConvertHeaders-illumina command line option–delim <delimiter>, 55

Index 119


–failed, 55–fasta, 55–outdir <out_dir>, 55–outname <out_name>, 56–version, 55-h, –help, 55-s <seq_files>, 55

ConvertHeaders-imgt command line option–delim <delimiter>, 56–failed, 56–fasta, 56–outdir <out_dir>, 56–outname <out_name>, 56–simple, 56–version, 56-h, –help, 56-s <seq_files>, 56

ConvertHeaders-sra command line option–delim <delimiter>, 57–failed, 57–fasta, 57–outdir <out_dir>, 57–outname <out_name>, 57–version, 56-h, –help, 56-s <seq_files>, 57

countSeqFile() (in module presto.IO), 83countSeqSets() (in module presto.IO), 84

Ddata_count (presto.Multiprocessing.SeqResult attribute),

86deleteSeqPositions() (in module presto.Sequence), 88

EEstimateError command line option

–delim <delimiter>, 57–freq <min_freq>, 58–log <log_file>, 57–maxdiv <max_diversity>, 58–mode {freq,qual}, 58–nproc <nproc>, 57–outdir <out_dir>, 57–outname <out_name>, 57–version, 57-f <set_field>, 58-h, –help, 57-n <min_count>, 58-q <min_qual>, 58-s <seq_files>, 57

FfeedSeqQueue() (in module presto.Multiprocessing), 86FilterSeq command line option


FilterSeq-length command line option–failed, 59–fasta, 59–inner, 59–log <log_file>, 59–nproc <nproc>, 59–outdir <out_dir>, 59–outname <out_name>, 59–version, 59-h, –help, 59-n <min_length>, 59-s <seq_files>, 59

FilterSeq-maskqual command line option–failed, 60–fasta, 60–log <log_file>, 60–nproc <nproc>, 60–outdir <out_dir>, 60–outname <out_name>, 60–version, 60-h, –help, 60-q <min_qual>, 60-s <seq_files>, 60

FilterSeq-missing command line option–failed, 61–fasta, 60–inner, 61–log <log_file>, 61–nproc <nproc>, 61–outdir <out_dir>, 61–outname <out_name>, 61–version, 60-h, –help, 60-n <max_missing>, 61-s <seq_files>, 60

FilterSeq-quality command line option–failed, 61–fasta, 61–inner, 62–log <log_file>, 61–nproc <nproc>, 61–outdir <out_dir>, 61–outname <out_name>, 61–version, 61-h, –help, 61-q <min_qual>, 62-s <seq_files>, 61

FilterSeq-repeats command line option–failed, 62–fasta, 62–inner, 62–log <log_file>, 62

120 Index


–missing, 62–nproc <nproc>, 62–outdir <out_dir>, 62–outname <out_name>, 62–version, 62-h, –help, 62-n <max_repeat>, 62-s <seq_files>, 62

FilterSeq-trimqual command line option–failed, 63–fasta, 63–log <log_file>, 63–nproc <nproc>, 63–outdir <out_dir>, 63–outname <out_name>, 63–reverse, 63–version, 63–win <window>, 63-h, –help, 63-q <min_qual>, 63-s <seq_files>, 63

findGapPositions() (in module presto.Sequence), 88flattenAnnotation() (in module presto.Annotation), 79frequencyConsensus() (in module presto.Sequence), 88

GgetAAScoreDict() (in module presto.Sequence), 89getAnnotationValues() (in module presto.Annotation), 80getCommonArgParser() (in module

presto.Commandline), 83getCoordKey() (in module presto.Annotation), 80getDNAScoreDict() (in module presto.Sequence), 89getFileType() (in module presto.IO), 84getOutputHandle() (in module presto.IO), 84

IindexSeqSets() (in module presto.Sequence), 89

MmakeBlastnDb() (in module presto.Applications), 81makeUBlastDb() (in module presto.Applications), 81manageProcesses() (in module presto.Multiprocessing),

87MaskPrimers command line option


MaskPrimers-align command line option–barcode, 65–delim <delimiter>, 64–failed, 64–fasta, 64–gap <gap_penalty>, 65–log <log_file>, 64–maxerror <max_error>, 65

–maxlen <max_len>, 65–mode {cut,mask,trim,tag}, 64–nproc <nproc>, 64–outdir <out_dir>, 64–outname <out_name>, 64–revpr, 65–skiprc, 65–version, 64-h, –help, 64-p <primer_file>, 64-s <seq_files>, 64

MaskPrimers-score command line option–barcode, 66–delim <delimiter>, 65–failed, 65–fasta, 65–log <log_file>, 65–maxerror <max_error>, 66–mode {cut,mask,trim,tag}, 66–nproc <nproc>, 66–outdir <out_dir>, 66–outname <out_name>, 66–revpr, 66–start <start>, 66–version, 65-h, –help, 65-p <primer_file>, 66-s <seq_files>, 65

mergeAnnotation() (in module presto.Annotation), 80

PPairSeq command line option

–1f <fields_1>, 67–2f <fields_2>, 67–coord {illumina,solexa,sra,454,presto}, 67–delim <delimiter>, 67–failed, 67–fasta, 67–outdir <out_dir>, 67–outname <out_name>, 67–version, 66-1 <seq_files_1>, 66-2 <seq_files_2>, 66-h, –help, 66

parseAnnotation() (in module presto.Annotation), 80parseCommonArgs() (in module presto.Commandline),

83ParseHeaders command line option


ParseHeaders-add command line option–delim <delimiter>, 68–failed, 68–fasta, 68

Index 121


–outdir <out_dir>, 68–outname <out_name>, 68–version, 68-f <fields>, 68-h, –help, 68-s <seq_files>, 68-u <values>, 68

ParseHeaders-collapse command line option–act {min,max,sum,first,last,set,cat}, 69–delim <delimiter>, 69–failed, 69–fasta, 69–outdir <out_dir>, 69–outname <out_name>, 69–version, 68-f <fields>, 69-h, –help, 69-s <seq_files>, 69

ParseHeaders-copy command line option–act {min,max,sum,first,last,set,cat}, 70–delim <delimiter>, 70–failed, 69–fasta, 69–outdir <out_dir>, 70–outname <out_name>, 70–version, 69-f <fields>, 70-h, –help, 69-k <names>, 70-s <seq_files>, 69

ParseHeaders-delete command line option–delim <delimiter>, 70–failed, 70–fasta, 70–outdir <out_dir>, 70–outname <out_name>, 71–version, 70-f <fields>, 71-h, –help, 70-s <seq_files>, 70

ParseHeaders-expand command line option–delim <delimiter>, 71–failed, 71–fasta, 71–outdir <out_dir>, 71–outname <out_name>, 71–sep <separator>, 71–version, 71-f <fields>, 71-h, –help, 71-s <seq_files>, 71

ParseHeaders-rename command line option–act {min,max,sum,first,last,set,cat}, 72–delim <delimiter>, 72

–failed, 72–fasta, 72–outdir <out_dir>, 72–outname <out_name>, 72–version, 72-f <fields>, 72-h, –help, 72-k <names>, 72-s <seq_files>, 72

ParseHeaders-table command line option–delim <delimiter>, 73–failed, 73–outdir <out_dir>, 73–outname <out_name>, 73–version, 72-f <fields>, 73-h, –help, 73-s <seq_files>, 73

ParseLog command line option–delim <delimiter>, 73–outdir <out_dir>, 73–outname <out_name>, 73–version, 73-f <fields>, 73-h, –help, 73-l <record_files>, 73

presto.Annotation (module), 79presto.Applications (module), 81presto.Commandline (module), 83presto.IO (module), 83presto.Multiprocessing (module), 86presto.Sequence (module), 87printLog() (in module presto.IO), 84printMessage() (in module presto.IO), 84printProgress() (in module presto.IO), 85processSeqQueue() (in module presto.Multiprocessing),

87

QqualityConsensus() (in module presto.Sequence), 89

RreadPrimerFile() (in module presto.IO), 85readReferenceFile() (in module presto.IO), 85readSeqFile() (in module presto.IO), 85renameAnnotation() (in module presto.Annotation), 81reverseComplement() (in module presto.Sequence), 90runBlastn() (in module presto.Applications), 81runMuscle() (in module presto.Applications), 82runUBlast() (in module presto.Applications), 82runUClust() (in module presto.Applications), 82

SscoreAA() (in module presto.Sequence), 90

122 Index


scoreDNA() (in module presto.Sequence), 90scoreSeqPair() (in module presto.Sequence), 90SeqData (class in presto.Multiprocessing), 86SeqResult (class in presto.Multiprocessing), 86SplitSeq command line option


SplitSeq-count command line option–fasta, 74–outdir <out_dir>, 74–outname <out_name>, 75–version, 74-h, –help, 74-n <max_count>, 75-s <seq_files>, 74

SplitSeq-group command line option–delim <delimiter>, 75–fasta, 75–num <threshold>, 75–outdir <out_dir>, 75–outname <out_name>, 75–version, 75-f <field>, 75-h, –help, 75-s <seq_files>, 75

SplitSeq-sample command line option–delim <delimiter>, 76–fasta, 76–outdir <out_dir>, 76–outname <out_name>, 76–version, 75-f <field>, 76-h, –help, 76-n <max_count>, 76-s <seq_files>, 76-u <values>, 76

SplitSeq-samplepair command line option–coord {illumina,solexa,sra,454,presto}, 77–delim <delimiter>, 77–fasta, 76–outdir <out_dir>, 77–outname <out_name>, 77–version, 76-1 <seq_files_1>, 76-2 <seq_files_2>, 76-f <field>, 77-h, –help, 76-n <max_count>, 77-u <values>, 77

SplitSeq-select command line option–delim <delimiter>, 77–fasta, 77–not, 78–outdir <out_dir>, 77

–outname <out_name>, 77–version, 77-f <field>, 78-h, –help, 77-s <seq_files>, 77-t <value_file>, 78-u <value_list>, 78

SplitSeq-sort command line option–delim <delimiter>, 78–fasta, 78–num, 78–outdir <out_dir>, 78–outname <out_name>, 78–version, 78-f <field>, 78-h, –help, 78-n <max_count>, 78-s <seq_files>, 78

subsetSeqIndex() (in module presto.Sequence), 91subsetSeqSet() (in module presto.Sequence), 91

TtranslateAmbigDNA() (in module presto.Sequence), 91

WweightSeq() (in module presto.Sequence), 91

Index 123