25
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering Kyungpook National University, South Korea

An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

  • Upload
    maia

  • View
    30

  • Download
    1

Embed Size (px)

DESCRIPTION

An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design. Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering Kyungpook National University, South Korea. Motivation. Issues for designing oligonucleotides To minimize the cross-hybridizations - PowerPoint PPT Presentation

Citation preview

Page 1: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Won-Hyong Chung and Seong-Bae Park

Dept. of Computer EngineeringKyungpook National University, South Korea

Page 2: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Motivation• Issues for designing oligonucleotides

– To minimize the cross-hybridizations– To minimize the computing time

• Seeding (or indexing) have been widely used for concurring those issues by means of pre-screening unreliable sequence regions before calculating cross-hybridizations.

• Although many types of seeding methods have been proposed, measure of evaluating the seeds regarding how adequate and efficient they are in the oligonucleotide design is not yet proposed.

Page 3: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Difference between alignment and oligonucleotide design

• Alignment– To find all possible alignments which have enough

scores.– Sensitivity is important, while specificity is usually

guaranteed by seed’s own specificity.• Oligoncleotide design

– To find optimal oligonucleotides to differentiate target sequences from the others.

– Specificity should be considered as well as sensitivity for checking cross-hybridization.

Page 4: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Objectives

• We propose novel measures of evaluating the seeds based on the discriminability and the efficiency.

• We examine five seeding methods in oligonucleotide design.– continuous, spaced, transition-constrained, BLAT, and

Vector seed• We provide a software package SeedChooser

which enables users to get the adequate seeds under their own experimental conditions.

Page 5: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

What is Seed?

• Seeding process– Filtering step: short fixed-length common words

which are found at both query and target sequences are selected.

– Extension step: the selected words are extended to the size of oligonucleotide and be checked the cross-hybridization.

Seed = the filtering template of the fixed-length words

Page 6: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Seeding methods (1/2)

• Continuous seed: a seed to find k-length exact matches– BLAST employs 11-bp length seed 11111111111

• Spaced seed: allowing don’t care letter labeled ‘0’ in the seed– 18-bp-length seed containing 11-bp matches 101101100111001011 is

used at PatternHunter.

• Transition-constrained seed: adopting transition (A <-> G, C <-> T) letter ‘@’ in the seed– YASS used such seed 1110@10010@1010111, it consists of 18-bp

length, 10-bp matches and 2 transitions.

Page 7: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Seeding methods (2/2)

• Blat seed: a continuous seed allowing one or two mismatches at any positions of the seed.

• Vector seed: a generalized seed by combining the idea of BLAT seed and spaced seed.

• BLAT seed and Vector seed allow some mismatches in any positions.– They greatly increase the sensitivity but spends much

more computing time than the previous seeds.

Page 8: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

The Issues of seeds for oligo design

• An ideal seed should filter all regions as fast as possible that have no possibility of being chosen as an oligo.

a seed should find as many oligos as possible

a seed should avoid to find non-oligo region

a seed should minimize the cost of indexing to

find oligos

Page 9: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Discriminability

The discriminability is a balance between precision and recall to minimize both false positives and false negatives.

indices seed of #

oligoshit indices seed of #P

oliogs of #

hit(s) seed containing oligos of #R

jumpalpha

Page 10: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

EfficiencyThe efficiency is the proportion of useful regions

filtered by a seed.– the duplication ratio of generated indices– the average number of indices in each oligo

indices seed unique of #

indices seed generated theof #D

oligos of #

oligosin indices seed of #A

jumpbeta, gamma

Page 11: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Efficient discriminability

The efficient discriminative seed is the seed that has the maximum efficient discriminability value for the given

Page 12: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Experiments• Empirically chosen seeds were evaluated by three

measures, discriminability, efficiency, and efficient discriminability, respectively.

• We tested the seeds for designing the 50mer oligos.– The parameters are set to 1 for evaluation.

• Simulated data set– A set of random sequences which are generated by

OligoGenerator in SeedChooser.• Biological data set

– Ecologically important genes involved in the nitrogen and carbon cycles.

– nirS: nitrite reductase gene set– pmoA: methane monooxygenase gene set

Page 13: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Discriminability of the five seeding methods

Seed weight

5 10 15 20 25 30

Dis

crim

inab

ility

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Continuous Spaced Transition BLAT Vector

Page 14: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Efficiency of the five seeding methods

Seed weight

5 10 15 20 25 30

Eff

icie

ncy

0.06

0.08

0.10

0.12

0.14

0.16

0.18

Continuous Spaced Transition BLAT Vector

Page 15: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Efficient Discriminability the five seeding methods

Seed weight

5 10 15 20 25 30

Eff

icie

nt

Dis

crim

inab

ility

0.02

0.04

0.06

0.08

0.10

0.12

Continuous Spaced Transition BLAT Vector

Page 16: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Evaluation results of pmoA data set

Page 17: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Evaluation results of nirS data set

Page 18: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

SeedChooser: Seed Evaluation and Recommendation Tools

• SeedChooser : To recommend best seeds by the evaluation parameters. It uses genetic algorithm to find best seeds.

• SeedEvaluator : To evaluate a set of the seeds by the parameters.

• OligoGenerator : To generate a set of oligos for the desired experimental conditions.

• SeedChooser homepagehttp://ml.knu.ac.kr/~whchung/seedchooser.html

Page 19: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

CONCLUSION

• The novel measure for evaluating the seeds in the oligo design based on the discriminability and the efficiency.

• The spaced seed was generally preferred to the other seeding methods.

• Our study can be applied to the oligo design programs in order to improve the performance by suggesting the experiment-specific seeds.

• We expect that our study will be helpful to the other genomic tasks.

Page 20: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Supplementary materials

Page 21: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

• T1, T2, T3: the target sequences.• P1 and P2 are the matched oligos for an oligo P0• S1, S2 and S3 are the seed indices for S0 by a seed.

T1

T2

T3

P1

P2

P0

S1

S2

S3

S0T0

back

Page 22: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Relations of precision, recall and discriminability

Seed weight

6 8 10 12 14 16 18 20 22 24 26

Dis

crim

inab

ility

0.2

0.4

0.6

0.8

1.0

1.2

Precision Recall Discriminability

Page 23: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Discriminability according to values of α

Seed weight

6 8 10 12 14 16 18 20 22 24 26

Dis

crim

inab

ility

0.2

0.4

0.6

0.8

1.0

1.2

8421

2/14/18/1

back

Page 24: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Efficiency according to values of β and γ

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.00.1

0.20.3

0.40.5

0.60.7

0.80.9

1.0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Eff

icie

ncy

Beta

Gamma

0.0 0.2 0.4 0.6 0.8 1.0

back

Page 25: An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design

Efficient Discriminability for 70mer Oligos

Seed weight

5 10 15 20 25 30

Eff

icie

nt

Dis

crim

inab

ility

0.00

0.02

0.04

0.06

0.08

0.10

Continuous Spaced Transition BLAT Vector