46
Owen Hardy Applications Scientist Genomics April 14, 2011 Custom Bait Designs using eArray XD April 14, 2011 Page 1

Custom Bait Designs using eArray XD - Agilent · Custom Bait Designs using eArray XD Page 1 April 14, 2011. ... • Tile with masking ON • Add each desired bait group created from

Embed Size (px)

Citation preview

Owen Hardy

Applications Scientist

Genomics

April 14, 2011

Custom Bait Designs using eArray XD

April 14, 2011Page 1

Agenda:

• Library design overview and tiling in eArray XD

–Part of Genomics Workbench, no license required for eArray XD portion

• Importing custom sequence into eArray XD

• Additional tools for bait library creation:

–Identifying orphan (singleton) baits and binning by GC content

•Galaxy Workflows

–http://main.g2.bx.psu.edu/

–Finding homology to other genomic loci•BLAT (UCSC)

–http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

Typical Workflow

Basic design with catalog species

• Define and load intervals

• Tile with masking ON

• Add each desired bait group created

from tiling to library

• Submit for manufacture

Custom Designs

• Import sequence as custom genome or use existing catalog species

• Define and load intervals

• Initial tile with masking ON

• Find uncovered targets and submit as new intervals

• Iterate tiling with new intervals

– Can turn masking OFF to return full set of baits that would be rejected otherwise

– Can relax overlap threshold to eg 40bp to return a subset of baits initially rejected. For 1X designs may also increase tiling factor for more flexible bait placement

• Take location and sequence from TDT result file and convert to FASTA

• BLAT the FASTA and keep baits with low score and few hits

• Assess GC content of baits

• Create bait groups for high GC, orphan, BLAT, and rest

• Create Library from bait groups and boost high GC and orphans

(b) Target region is 2 times the bait length

baits

(a) Target region is large (example of 2x tiling)

(c) Target region is shorter than the bait length

Bait Tiling DesignCentered versus Justified

JustifiedCentered

target region

Design is the same for

Centered and Justified

Centered baits extend past boundaries, but have

even coverage across regionJustified baits do not

extend past boundaries, but may have uneven coverage. This method is used for our

catalog designs for exons >120 bp.

Design is the same for

Centered and Justified

April 14, 2011Page 4

Bait Tiling DesignDesign Strategy

Slide 5

1. Change the parameters: To be able to change these parameters,

first remove the checkbox from the “Use Optimized Parameters” option.

2. Centered versus Justified: (see next slide for visuals)

3. Bait Length: Currently, 120 bp is available as the only bait

length option. All baits will be designed to be 120 bp in length.

4. Bait Tiling Frequency:

Options include 1X, 2X, 3X, 4X, and 5X and indicate the amount of bait overlap. Tiling frequency is not enforced at

target edges. Increasing the frequency will lead to the ability to cover fewer or smaller regions in a library.

5. Allowed overlap

into avoid regions:

Centered baits may

overlap with regions

adjacent to the target.

In case the targets are

adjacent to Avoid

Regions, enter the

acceptable amount of

overlap with these in

bp. To ensure that

there is no overlap in

any Avoid Regions,

select „0 bp‟.

Target

Baits

1X Tiling 2X Tiling 3X Tiling 4X Tiling 5X Tiling

April 14, 2011

What tiling factor to use?

Find your total capture bp (size your fasta or sum the intervals)

• Approximately 25-50% will be masked with Repeatmasker ON with soft-masked target

• Single libraries can have up to 57,680 unique 120mer baits (~6.9Mb after masking, 1X tiling)

Example: 25% masked out

• 4 Mb of input x 75% = 3 Mb net tileable

• 3.0 Mb / 57,680 baits = 52 bp spacing

• 120mer / 52 base stagger = 2.3

• So, can use up to 2X tiling, with some room for replication and add ons.

April 14, 2011Page 6

Tiling Factor Bait Spacing Read Length Max Target Size (1

Lib)

1X End to end (every 120bp) Long (76bp) SE or PE 6.9Mb

2X Every 60bp Long or Short SE or PE 3.4Mb

3X Every 40bp Long or Short SE or PE 2.3Mb

4X Every 30bp Long or Short SE or PE 1.7Mb

5X Every 24bp Long or Short SE or PE 1.4Mb

Determine your regions of interest:

List in a single column, tab delimited text file by 1 of these:

Accessions Gene Symbols Cytobands OR Coordinates

April 14, 2011Page 7

NM_004457

NM_001009185

NM_004302

NM_207517

NM_001619

NM_144508

NM_006818

NM_016453

NM_014423

NM_001199199

chr22:19702304-19712314

chr3:154013073-154023086

chr10:27060004-27065018

chr9:131371930-131381944

chr2:99165418-99195432

chr9:132678245-132678259

chr3:47168138-47168153

chr9:131390204-131390221

chr11:113931288-113931306

ABCF1

ATP6V1G2

BRD2

CENPBD1

KDM4DL

LSM2

NFKBIL1

OR10C1

MSH5

PSMB8

WDR46

1p22.2

3p14.1

5p2.3

etc.

Custom Sequence Import for eArray XD• Create a *.zip file that contains one or more FASTA format sequence files

• Each FASTA file within the *.zip must have a unique name and saved with .fa extension,

―chr1.fa‖

• The header line in each sequence file must be the name of the specific chromosome

associated with the sequence in the file, ―>chr1‖. The simpler the better with no extra

special characters as this header is the SAME as the target chromosome in your interval file.

• Each file must contain exactly one nucleotide sequence.

• The sequence in each file must be at least 60 nucleotides in length

• You can “soft mask” repeat sequences.

• You can also import prior builds of assembled genomes as is (eg hg18) from sources like

UCSC. Download the softmasked *.gzip and extract the chromosomal .fa files, and re-zip

them to single file for upload to eArray XD.

April 14, 2011Page 8

Custom Sequence Import

Possible issues:

• You have hundreds to thousands of scaffolds in a single file that need to be separated to their own file,

which can be unwieldy

– Import could fail on one file that is not formatted correctly etc.

– Each fasta file is treated as its own chromosome and base numbering will start “fresh” in each file

– Coordinates for target intervals would need to be created for each file and adjusted based upon the size

of each sequence

– short exons or ESTs that are ~60-119 bp cannot be tiled as is and will need to be “padded” in some way

either with flanking sequence or a “foreign” sequence that should not hybridize with target or adaptor

sequence.

One possible route for joining multiple sequences:

• Create an artificial single chromosome

• Merge all files with “N” spacers between each scaffold

– Galaxy workflow ―Sureselect Pack Multi-Fasta for Earray Import‖

• eArray will not tile across “N” junctions

• Single target interval

• Absolute base position of baits is lost but can be back-calculated if desired

April 14, 2011Page 9

Galaxy Web Tools

http://main.g2.bx.psu.edu

• Many useful tools to manipulate interval and tab files (eg

BED), sequences, and others

• Ties into UCSC genome browser and others

• Can create and share workflows for repetitive tasks

April 14, 2011Page 10

Galaxy: example set of multiple fasta sequences

April 14, 2011Page 11

Galaxy Workflows

April 14, 2011Page 12

Galaxy Workflow: convert multiple fasta to a single

file for eArray

April 14, 2011Page 13

Galaxy Workflow: convert multiple fasta to a single

file for eArray

April 14, 2011Page 14

Scaffold name

and size in order

joined

Download, save as .fa and zip

April 14, 2011Page 15

Genomics Workbench Home Tab

Choose Workspace from Toolbar

Import your own custom sequence

as a zip file

April 14, 2011Page 16

Choose eArray workspace from

toolbar

eArray XD Workspace

Pending Jobs for Tiling, Upload, and Import.

Job status moves from gray to yellow to green

upon completion.

Analogous to Design Wizards in web

eArray: choose to tile or upload baits

Results from QC

Analyzer

Results from Literature

Searches

--Catalog source for download/search

--Your Workgroup

--Imported External designs

--custom designs, including those on earray

website

Bait Tiling—to create single bait group

April 14, 2011Page 17

Bait Tiling—to create single bait group

April 14, 2011Page 18

Create Library from Bait Tiling Wizard

April 14, 2011Page 19

April 14, 2011Page 20

Check Design and Download Result Files

April 14, 2011Page 21

View the results of the bait tiling jobDownload the four files included in the job’s zip file:

Example of a BED file:

Example of a TDT (tab delimited table) file:

Example of a Fate file:

Example of a Summary file:

April 14, 2011Page 22

Customizations

• Find precise sub-intervals that are uncovered after initial tiling

• Find any “orphan” baits to separate to own bait group and

replicate in the library

• Separate out high GC baits to separate to own bait group

• Retile with increased avoid overlaps or with Repeatmasker

OFF

• BLAT retiled baits for homology to off-target loci and add to

your library

April 14, 2011Page 23

Finding Uncovered Target Regions After Tiling

• Download and open fate file

– Status=“Fail” means the intervals are invalid (out of range or not in +

strand orientation)

– Status=“Pass” & “Baits Generated” =0 means any baits made were

rejected due to overlap with repeat/avoid intervals

– Can re-tile these intervals with relaxed/no masking and BLAT results

April 14, 2011Page 24

Finding Uncovered Target Regions After Tiling

Some problems using just fate file:

• although the fate file tells you number of baits for a given target interval,

you don‟t necessarily have a good idea of fractional coverage or missed

areas; but simply whether a given interval is hit or not.

• Tiling with few/long input intervals it is hard to judge coverage based on

bait count.

• a missed exon in a particular gene/interval that is otherwise well covered

but you want to iterate

Can use Galaxy Workflow

– ―Sureselect Find Uncovered Target Intervals‖

April 14, 2011Page 25

Example Coverage

with Orphan

Agilent SureSelect™ Platform

Enabling Products for the Next-

Generation Sequencing Workflow

Page 26

Sequence coverage on a non-PAR portion of Chromosome X using 2x tiling.

Later designs added reps to the “orphan baits” on short exons to even out coverage.

269

0

Bait Coordinates @ 2x

tiling

Orphan bait has lower

coverage without

―boosting‖ its reps

• Orphans can result from:• tiling an interval <120bp (only1 bait made assuming no repeat)

• tiling an interval >121bp where subsequent baits overlap repeats

and are rejected

• Typically a gap of 20bp can still get good coverage, so an

orphan may be defined as being >20bp from neighbor on both

sides

• Can locate orphans by clustering baits within 20bp

Finding Orphans After Tiling

To reduce Orphans before tiling – separate or pad short

intervals

--Top intervals at <121 bp would get

1 bait regardless of tiling factor

(or Zero if lying in a repeat overlap)

--can split this list into 2 files,

tile separately, & assign 1 or 2 more reps

to the short exon bait group

--can also pad short intervals and tile

all together

What to do with High GC?

• Above ~65% GC coverage begins to decline on average

– Multiple possible factors including bait hybridization, PCR amplification,

and sequencing

– Low GC % typically does not correlate as much

• Not straightforward to identify GC rich targets prior to tiling

• After tiling you can simply calculate GC % of all baits and

separate out to a separate bait group

• Boost this group as you would orphans

• If your „main design‟ is 1X and you have room in the library,

you could also retile the high GC group at higher tiling factor

To find uncovered targets and bin baits by

orphan status and GC %

External tool Galaxy: http://main.g2.bx.psu.edu/

Workflows:

• Sureselect Find Uncovered Target Intervals

• Sureselect Bin Orphan and High GC Baits

April 14, 2011Page 30

Initial Tiling in eArray

Tiling job gives you 4 files

• Fate, Summary, BED, TDT

– Open your BED and TDT in Excel for example. Copy and paste into the

BED file to keep track as combined file going forward:

• baitlocation column

• sequence column

Example Combined BED/TDT

• You can use the score column as a way to filter or keep track

of baits you add, for example after BLAT.

• You might leave baits from initial tiling to 1000, and baits

added by BLAT to 500

• This will color code it in UCSC as well

ScoreFrom BED From TDT

Upload BED/TDT and target intervals to Galaxy

April 14, 2011Page 33

Find Uncovered Intervals

April 14, 2011Page 34

Column Value

1 Target chromosome

2 Target start

3 Target stop

4 Bases covered in target interval

5 Fraction of target covered

6 # baits that fall completely in the target interval

7 # baits that partially overlap the target interval

8 #bp in target interval not covered

Find Uncovered Intervals

April 14, 2011Page 35

Sub-intervals with no bait

Find Uncovered Intervals

April 14, 2011Page 36

Intervals with more than 20bp to cover

Finding Orphans and Compute GC%

April 14, 2011Page 37

Combined BED/TDT

Will give you:

--total base coverage

--file for baits above 65% GC

--file for orphans

--file for remaining

BLAT

Web-BLAT (via UCSC) will output a homology score for your

input sequence(s) vs the target sequence (genome)

– Useful for target enrichment as a given 120mer may have been rejected

during tiling for repeat overlap, but may have low potential for off-target

hybridization relative to a “perfect match.”

– Homology score is calculated based on:

• (matches –(mismatches + gaps in query+ gaps in target)

• Perfect score is 120 for a bait

Web-blat server is limited to batches of 25

Standalone BLAT can be used for frequent or large batches

April 14, 2011Page 39

UCSC Genome Browser

Batch BLAT of candidate baits converted to FASTA file

April 14, 2011Page 40

BLAT Search Result

No concrete rules, but:

Look for baits with few secondary hits and low secondary score

Try to avoid long spans (~>40bp) of identity to an off-target locus

BAD

Good

Preparing Baits for eArray

• From Galaxy download results for:

– High GC group

– Orphan group

– Remaining Non-orphans and mid-GC %

• Add any baits filtered via BLAT

• Upload to eArray, each file as its own bait group

• Create Library:

– Add each bait group to the library

– Replicate orphans and high GC 2X vs others

April 14, 2011Page 41

Preparing Baits for eArray Website

Format like this….

From galaxy

output

Create these

Can be

blank

Preparing Baits for eArray

April 14, 2011Page 43

Downloaded from

Galaxy

Create Library and Submit

April 14, 2011Page 44

Listen to our past recorded

and register for upcoming live webinars

Feb 8 Cost-Effective Options using SureSelect System in Exome Sequencing and

Targeted Applications Josh Wang, PhD

March 1 Bionalyzer applications for Next Gen Sequencing: Updates and Tips Charmian

Cher, PhD

March 3 SureSelect Custom Library Design Optimization David Willmot, PhD

March 8 Assessing the quantity of index-tagged libraries by QPCR Charmian Cher, PhD

March 24 QPCR: Mx Software Training - How to set up an absolute quantitation

experiment and a relative quantitation experiment on the MX Cathy Cutler

April 14 SureSelect - Designing baits for custom genomes using eArray XD Owen Hardy

April 20 Interpretation of Next Gen Sequencing Performance metrics in Agilent Genomic

Workbench David Willmot, PhD

www.agilent.com/genomics/eSeminarSeries

April 14, 2011Page 45

Thank you for attending! Questions?

For more information:

www.genomics.agilent.com

[email protected]

1-800-227-9770 options x3x4x4

Genomics Workbench:

genomics.agilent.com

April 14, 2011Page 46