18
String Barcoding Uncovering Optimal Virus Signatures Sam Rash, Dan Gusfield University of California, Davis.

String Barcoding

  • Upload
    paniz

  • View
    13

  • Download
    0

Embed Size (px)

DESCRIPTION

String Barcoding. Uncovering Optimal Virus Signatures. Sam Rash, Dan Gusfield University of California, Davis. Motivation. Need for rapid virus detection Given unknown virus database known viruses Problem identify unknown virus quickly Ideal solution have sequence of - PowerPoint PPT Presentation

Citation preview

Page 1: String Barcoding

String Barcoding

Uncovering Optimal Virus Signatures

Sam Rash, Dan Gusfield

University of California, Davis.

Page 2: String Barcoding

Motivation

• Need for rapid virus detection– Given

• unknown virus • database known viruses

– Problem• identify unknown virus quickly

– Ideal solution• have sequence of

– viruses in database– unknown virus

• Solution– use BLAST (or any sequence similarity program/algorithm)

Page 3: String Barcoding

Motivation

– Real World• only have sequence for pathogens in database

– not possible to quickly sequence an unknown virus

• can test for presence small (<= 50 bp) strings in unknown virus

– substring tests

– Another Idea• String Barcoding

– use substring tests to uniquely identify each virus in the database

– acquire unique barcode for each virus in database

Page 4: String Barcoding

Similar Work

• Borneman et al, 2001– Work similar to String Barcoding– Focused on bacterial size data

• used a different approach tailored to their needs

Page 5: String Barcoding

Problem Definition

• Formal Definition– given

• set of strings S

– goal• find set of strings S’, the testing set

– wlog, for each s1,s2 in S, there exists at least one u in S’ where u is a substring of only s1

– u is a signature substring

• minimize |S’|

– result• barcode for each element on S

Page 6: String Barcoding

Problem Complexity

• Complexity– unknown if NP-hard when size of any u in S’ is

unbounded

– Max-Length String Barcoding• additional parameter k, a maximum length of any u in S’

• this variant is NP-Hard

• reduction from Minimum Testing Set (Garey, Johnson, 1979)

• means all real world uses have to deal with NP-hard result

Page 7: String Barcoding

Implementation

• Basic Idea: Formulate problem as an ILP– Enumerate some “useful” set of substrings from

S• variable in ILP for each substring

– Constraint for each pair of strings in S• means that at least one substring will be chosen to

distinguish each pair

– Objective Function• Minimize sum of variables in ILP

Page 8: String Barcoding

Implementation

– Key point: complexity of ILP primarily a function of the number of variables

– reducing number of candidate substring tests reduces the number of variables in ILP

– how to reduce?• Key to our method: suffix trees

• finds minimum cardinality set of “useful” substrings for use as candidate signature substrings

Page 9: String Barcoding

Implementation: Suffix Trees

• Key Properties of Suffix Tree build for set of strings S– tree with character sequences labeling edges– nodes labeled with a subset of original string

IDs– every substring of original input set appears as

a root-edge walk exactly once• root-node walk is considered root-edge walk into

node’s in-edge from parent

Page 10: String Barcoding

Implementation: Suffix Trees– root-edge walk

• Creates string – appears in exactly the strings that

label the node at which it ends

• 2 root-edge walks ending on the same edge

– Both strings created by the walk

occur in exactly the same set of original strings

• Can use ether string example - a root edge walkc

a

g

t

t

a

c g

a

g

t t

g t t c

c g a

Page 11: String Barcoding

Implementation: Solving

• If two substrings occur in exactly the same set of original strings, only one need be considered– Use strings from suffix tree for each uniquely labeled

node

• Build ILP as discussed• Solve ILP using CPLEX• Acquire barcode and signatures for each original

string– signature is the set of substring tests occurring in a

string

Page 12: String Barcoding

Implementation: Example

• strings:1. cagtgc

2. cagttc

3. catgga

• Each node in the suffix tree has a corresponding set of string IDs below it v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Figure 1.2 - table of string labels for each node in suffix tree from figure 1.1

Page 13: String Barcoding

Implementation: ExampleminimizeV18 + V22 + V11 + V17 + V8 #objective functionstV18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimumV18 + V17 + V8 >= 1 #constraint to cover pair 1,2V22 + V11 + V8 >= 1 #constraint to cover pair 1,3V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3binaries #all variables are 0/1V18 V22 V11 V17 V8end

Figure 1.3 - ILP constructed for suffix tree in figure 1.1 using no additional constraints (length, etc)

Figure 1.4 - barcodes

Figure 1.5 - signatures

tg (V18) atgga (V22)

cagtgc 1 0

cagttc 0 0

catgga 1 1

cagtgc {“tg”}

cagttc

catgga {“tg”, “atgga”}

Page 14: String Barcoding

Implementation: Extensions

• minimum and maximum lengths on signature substrings

• acquire barcodes/signatures for only a subset of input strings (wrt to whole set)

• minimum string edit distance between chosen signature substrings

• redundancy– require r signature substrings to differentiate each pair– adds a higher level of confidence that signatures remain

valid even with mutations

Page 15: String Barcoding

Results: Summary

• Works quickly on most moderately sized datasets (especially when redundancy >= 2)– dataset properties

• ~50k virus genomes taken from NCBI (Genbank)• 50-150 virus genomes• average length of each genome ~1000 characters• total input size ranged from approximately 50,000 – 150,000

characters• increasing dataset size scaled approximately linearly

– reach 25% gap (at most 1/3 more than optimum) in just a few minutes

– reach small gap (often < 1%) in 4 hours

Page 16: String Barcoding

Results: Summary

– increasing redundancy greatly decreases run time and % gap at 4 hours in all cases tested

Figure 2.1 - effect of redundancy on avg 25% gap Figure 2.2 - effect of redundancy on avg gap at 4 hours

Page 17: String Barcoding

Conclusion

• Practical sized testing sets obtained on reasonable sized input datasets– testing set consisting of 50 – 270 substring tests on input sets

of ~100 genomes– works well with reactions that have high number of assays

(substring tests) per reaction• GeneChip – 400 assays per reaction

• Redundancy– Good concept in theory– Reduces solution space and hence computation time– GeneChip makes higher number of assays needed cost-

effective

Page 18: String Barcoding

Future Work

• Expand to work on even larger datasets

• Improve ILP solving– use other ILP approximations

• Determine if unconstrained String Barcoding is NP-hard

• More Applications?