Upload
nguyentu
View
216
Download
0
Embed Size (px)
Citation preview
Owen Hardy
Applications Scientist
Genomics
April 14, 2011
Custom Bait Designs using eArray XD
April 14, 2011Page 1
Agenda:
• Library design overview and tiling in eArray XD
–Part of Genomics Workbench, no license required for eArray XD portion
• Importing custom sequence into eArray XD
• Additional tools for bait library creation:
–Identifying orphan (singleton) baits and binning by GC content
•Galaxy Workflows
–http://main.g2.bx.psu.edu/
–Finding homology to other genomic loci•BLAT (UCSC)
–http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
Typical Workflow
Basic design with catalog species
• Define and load intervals
• Tile with masking ON
• Add each desired bait group created
from tiling to library
• Submit for manufacture
Custom Designs
• Import sequence as custom genome or use existing catalog species
• Define and load intervals
• Initial tile with masking ON
• Find uncovered targets and submit as new intervals
• Iterate tiling with new intervals
– Can turn masking OFF to return full set of baits that would be rejected otherwise
– Can relax overlap threshold to eg 40bp to return a subset of baits initially rejected. For 1X designs may also increase tiling factor for more flexible bait placement
• Take location and sequence from TDT result file and convert to FASTA
• BLAT the FASTA and keep baits with low score and few hits
• Assess GC content of baits
• Create bait groups for high GC, orphan, BLAT, and rest
• Create Library from bait groups and boost high GC and orphans
(b) Target region is 2 times the bait length
baits
(a) Target region is large (example of 2x tiling)
(c) Target region is shorter than the bait length
Bait Tiling DesignCentered versus Justified
JustifiedCentered
target region
Design is the same for
Centered and Justified
Centered baits extend past boundaries, but have
even coverage across regionJustified baits do not
extend past boundaries, but may have uneven coverage. This method is used for our
catalog designs for exons >120 bp.
Design is the same for
Centered and Justified
April 14, 2011Page 4
Bait Tiling DesignDesign Strategy
Slide 5
1. Change the parameters: To be able to change these parameters,
first remove the checkbox from the “Use Optimized Parameters” option.
2. Centered versus Justified: (see next slide for visuals)
3. Bait Length: Currently, 120 bp is available as the only bait
length option. All baits will be designed to be 120 bp in length.
4. Bait Tiling Frequency:
Options include 1X, 2X, 3X, 4X, and 5X and indicate the amount of bait overlap. Tiling frequency is not enforced at
target edges. Increasing the frequency will lead to the ability to cover fewer or smaller regions in a library.
5. Allowed overlap
into avoid regions:
Centered baits may
overlap with regions
adjacent to the target.
In case the targets are
adjacent to Avoid
Regions, enter the
acceptable amount of
overlap with these in
bp. To ensure that
there is no overlap in
any Avoid Regions,
select „0 bp‟.
Target
Baits
1X Tiling 2X Tiling 3X Tiling 4X Tiling 5X Tiling
April 14, 2011
What tiling factor to use?
Find your total capture bp (size your fasta or sum the intervals)
• Approximately 25-50% will be masked with Repeatmasker ON with soft-masked target
• Single libraries can have up to 57,680 unique 120mer baits (~6.9Mb after masking, 1X tiling)
Example: 25% masked out
• 4 Mb of input x 75% = 3 Mb net tileable
• 3.0 Mb / 57,680 baits = 52 bp spacing
• 120mer / 52 base stagger = 2.3
• So, can use up to 2X tiling, with some room for replication and add ons.
April 14, 2011Page 6
Tiling Factor Bait Spacing Read Length Max Target Size (1
Lib)
1X End to end (every 120bp) Long (76bp) SE or PE 6.9Mb
2X Every 60bp Long or Short SE or PE 3.4Mb
3X Every 40bp Long or Short SE or PE 2.3Mb
4X Every 30bp Long or Short SE or PE 1.7Mb
5X Every 24bp Long or Short SE or PE 1.4Mb
Determine your regions of interest:
List in a single column, tab delimited text file by 1 of these:
Accessions Gene Symbols Cytobands OR Coordinates
April 14, 2011Page 7
NM_004457
NM_001009185
NM_004302
NM_207517
NM_001619
NM_144508
NM_006818
NM_016453
NM_014423
NM_001199199
chr22:19702304-19712314
chr3:154013073-154023086
chr10:27060004-27065018
chr9:131371930-131381944
chr2:99165418-99195432
chr9:132678245-132678259
chr3:47168138-47168153
chr9:131390204-131390221
chr11:113931288-113931306
ABCF1
ATP6V1G2
BRD2
CENPBD1
KDM4DL
LSM2
NFKBIL1
OR10C1
MSH5
PSMB8
WDR46
1p22.2
3p14.1
5p2.3
etc.
Custom Sequence Import for eArray XD• Create a *.zip file that contains one or more FASTA format sequence files
• Each FASTA file within the *.zip must have a unique name and saved with .fa extension,
―chr1.fa‖
• The header line in each sequence file must be the name of the specific chromosome
associated with the sequence in the file, ―>chr1‖. The simpler the better with no extra
special characters as this header is the SAME as the target chromosome in your interval file.
• Each file must contain exactly one nucleotide sequence.
• The sequence in each file must be at least 60 nucleotides in length
• You can “soft mask” repeat sequences.
• You can also import prior builds of assembled genomes as is (eg hg18) from sources like
UCSC. Download the softmasked *.gzip and extract the chromosomal .fa files, and re-zip
them to single file for upload to eArray XD.
April 14, 2011Page 8
Custom Sequence Import
Possible issues:
• You have hundreds to thousands of scaffolds in a single file that need to be separated to their own file,
which can be unwieldy
– Import could fail on one file that is not formatted correctly etc.
– Each fasta file is treated as its own chromosome and base numbering will start “fresh” in each file
– Coordinates for target intervals would need to be created for each file and adjusted based upon the size
of each sequence
– short exons or ESTs that are ~60-119 bp cannot be tiled as is and will need to be “padded” in some way
either with flanking sequence or a “foreign” sequence that should not hybridize with target or adaptor
sequence.
One possible route for joining multiple sequences:
• Create an artificial single chromosome
• Merge all files with “N” spacers between each scaffold
– Galaxy workflow ―Sureselect Pack Multi-Fasta for Earray Import‖
• eArray will not tile across “N” junctions
• Single target interval
• Absolute base position of baits is lost but can be back-calculated if desired
April 14, 2011Page 9
Galaxy Web Tools
http://main.g2.bx.psu.edu
• Many useful tools to manipulate interval and tab files (eg
BED), sequences, and others
• Ties into UCSC genome browser and others
• Can create and share workflows for repetitive tasks
April 14, 2011Page 10
Galaxy Workflow: convert multiple fasta to a single
file for eArray
April 14, 2011Page 14
Scaffold name
and size in order
joined
Download, save as .fa and zip
April 14, 2011Page 15
Genomics Workbench Home Tab
Choose Workspace from Toolbar
Import your own custom sequence
as a zip file
April 14, 2011Page 16
Choose eArray workspace from
toolbar
eArray XD Workspace
Pending Jobs for Tiling, Upload, and Import.
Job status moves from gray to yellow to green
upon completion.
Analogous to Design Wizards in web
eArray: choose to tile or upload baits
Results from QC
Analyzer
Results from Literature
Searches
--Catalog source for download/search
--Your Workgroup
--Imported External designs
--custom designs, including those on earray
website
View the results of the bait tiling jobDownload the four files included in the job’s zip file:
Example of a BED file:
Example of a TDT (tab delimited table) file:
Example of a Fate file:
Example of a Summary file:
April 14, 2011Page 22
Customizations
• Find precise sub-intervals that are uncovered after initial tiling
• Find any “orphan” baits to separate to own bait group and
replicate in the library
• Separate out high GC baits to separate to own bait group
• Retile with increased avoid overlaps or with Repeatmasker
OFF
• BLAT retiled baits for homology to off-target loci and add to
your library
April 14, 2011Page 23
Finding Uncovered Target Regions After Tiling
• Download and open fate file
– Status=“Fail” means the intervals are invalid (out of range or not in +
strand orientation)
– Status=“Pass” & “Baits Generated” =0 means any baits made were
rejected due to overlap with repeat/avoid intervals
– Can re-tile these intervals with relaxed/no masking and BLAT results
April 14, 2011Page 24
Finding Uncovered Target Regions After Tiling
Some problems using just fate file:
• although the fate file tells you number of baits for a given target interval,
you don‟t necessarily have a good idea of fractional coverage or missed
areas; but simply whether a given interval is hit or not.
• Tiling with few/long input intervals it is hard to judge coverage based on
bait count.
• a missed exon in a particular gene/interval that is otherwise well covered
but you want to iterate
Can use Galaxy Workflow
– ―Sureselect Find Uncovered Target Intervals‖
April 14, 2011Page 25
Example Coverage
with Orphan
Agilent SureSelect™ Platform
Enabling Products for the Next-
Generation Sequencing Workflow
Page 26
Sequence coverage on a non-PAR portion of Chromosome X using 2x tiling.
Later designs added reps to the “orphan baits” on short exons to even out coverage.
269
0
Bait Coordinates @ 2x
tiling
Orphan bait has lower
coverage without
―boosting‖ its reps
• Orphans can result from:• tiling an interval <120bp (only1 bait made assuming no repeat)
• tiling an interval >121bp where subsequent baits overlap repeats
and are rejected
• Typically a gap of 20bp can still get good coverage, so an
orphan may be defined as being >20bp from neighbor on both
sides
• Can locate orphans by clustering baits within 20bp
Finding Orphans After Tiling
To reduce Orphans before tiling – separate or pad short
intervals
--Top intervals at <121 bp would get
1 bait regardless of tiling factor
(or Zero if lying in a repeat overlap)
--can split this list into 2 files,
tile separately, & assign 1 or 2 more reps
to the short exon bait group
--can also pad short intervals and tile
all together
What to do with High GC?
• Above ~65% GC coverage begins to decline on average
– Multiple possible factors including bait hybridization, PCR amplification,
and sequencing
– Low GC % typically does not correlate as much
• Not straightforward to identify GC rich targets prior to tiling
• After tiling you can simply calculate GC % of all baits and
separate out to a separate bait group
• Boost this group as you would orphans
• If your „main design‟ is 1X and you have room in the library,
you could also retile the high GC group at higher tiling factor
To find uncovered targets and bin baits by
orphan status and GC %
External tool Galaxy: http://main.g2.bx.psu.edu/
Workflows:
• Sureselect Find Uncovered Target Intervals
• Sureselect Bin Orphan and High GC Baits
April 14, 2011Page 30
Initial Tiling in eArray
Tiling job gives you 4 files
• Fate, Summary, BED, TDT
– Open your BED and TDT in Excel for example. Copy and paste into the
BED file to keep track as combined file going forward:
• baitlocation column
• sequence column
Example Combined BED/TDT
• You can use the score column as a way to filter or keep track
of baits you add, for example after BLAT.
• You might leave baits from initial tiling to 1000, and baits
added by BLAT to 500
• This will color code it in UCSC as well
ScoreFrom BED From TDT
Find Uncovered Intervals
April 14, 2011Page 34
Column Value
1 Target chromosome
2 Target start
3 Target stop
4 Bases covered in target interval
5 Fraction of target covered
6 # baits that fall completely in the target interval
7 # baits that partially overlap the target interval
8 #bp in target interval not covered
Finding Orphans and Compute GC%
April 14, 2011Page 37
Combined BED/TDT
Will give you:
--total base coverage
--file for baits above 65% GC
--file for orphans
--file for remaining
BLAT
Web-BLAT (via UCSC) will output a homology score for your
input sequence(s) vs the target sequence (genome)
– Useful for target enrichment as a given 120mer may have been rejected
during tiling for repeat overlap, but may have low potential for off-target
hybridization relative to a “perfect match.”
– Homology score is calculated based on:
• (matches –(mismatches + gaps in query+ gaps in target)
• Perfect score is 120 for a bait
Web-blat server is limited to batches of 25
Standalone BLAT can be used for frequent or large batches
April 14, 2011Page 40
BLAT Search Result
No concrete rules, but:
Look for baits with few secondary hits and low secondary score
Try to avoid long spans (~>40bp) of identity to an off-target locus
BAD
Good
Preparing Baits for eArray
• From Galaxy download results for:
– High GC group
– Orphan group
– Remaining Non-orphans and mid-GC %
• Add any baits filtered via BLAT
• Upload to eArray, each file as its own bait group
• Create Library:
– Add each bait group to the library
– Replicate orphans and high GC 2X vs others
April 14, 2011Page 41
Listen to our past recorded
and register for upcoming live webinars
Feb 8 Cost-Effective Options using SureSelect System in Exome Sequencing and
Targeted Applications Josh Wang, PhD
March 1 Bionalyzer applications for Next Gen Sequencing: Updates and Tips Charmian
Cher, PhD
March 3 SureSelect Custom Library Design Optimization David Willmot, PhD
March 8 Assessing the quantity of index-tagged libraries by QPCR Charmian Cher, PhD
March 24 QPCR: Mx Software Training - How to set up an absolute quantitation
experiment and a relative quantitation experiment on the MX Cathy Cutler
April 14 SureSelect - Designing baits for custom genomes using eArray XD Owen Hardy
April 20 Interpretation of Next Gen Sequencing Performance metrics in Agilent Genomic
Workbench David Willmot, PhD
www.agilent.com/genomics/eSeminarSeries
April 14, 2011Page 45
Thank you for attending! Questions?
For more information:
www.genomics.agilent.com
1-800-227-9770 options x3x4x4
Genomics Workbench:
genomics.agilent.com
April 14, 2011Page 46