String Barcoding

String Barcoding

Uncovering Optimal Virus Signatures

Sam Rash, Dan Gusfield

University of California, Davis.

Motivation

• Need for rapid virus detection– Given

• unknown virus • database known viruses

– Problem• identify unknown virus quickly

– Ideal solution• have sequence of

– viruses in database– unknown virus

• Solution– use BLAST (or any sequence similarity program/algorithm)

Motivation

– Real World• only have sequence for pathogens in database

– not possible to quickly sequence an unknown virus

• can test for presence small (<= 50 bp) strings in unknown virus

– substring tests

– Another Idea• String Barcoding

– use substring tests to uniquely identify each virus in the database

– acquire unique barcode for each virus in database

Similar Work

• Borneman et al, 2001– Work similar to String Barcoding– Focused on bacterial size data

• used a different approach tailored to their needs

Problem Definition

• Formal Definition– given

• set of strings S

– goal• find set of strings S’, the testing set

– wlog, for each s1,s2 in S, there exists at least one u in S’ where u is a substring of only s1

– u is a signature substring

• minimize |S’|

– result• barcode for each element on S

Problem Complexity

• Complexity– unknown if NP-hard when size of any u in S’ is

unbounded

– Max-Length String Barcoding• additional parameter k, a maximum length of any u in S’

• this variant is NP-Hard

• reduction from Minimum Testing Set (Garey, Johnson, 1979)

• means all real world uses have to deal with NP-hard result

Implementation

• Basic Idea: Formulate problem as an ILP– Enumerate some “useful” set of substrings from

S• variable in ILP for each substring

– Constraint for each pair of strings in S• means that at least one substring will be chosen to

distinguish each pair

– Objective Function• Minimize sum of variables in ILP

Implementation

– Key point: complexity of ILP primarily a function of the number of variables

– reducing number of candidate substring tests reduces the number of variables in ILP

– how to reduce?• Key to our method: suffix trees

• finds minimum cardinality set of “useful” substrings for use as candidate signature substrings

Implementation: Suffix Trees

• Key Properties of Suffix Tree build for set of strings S– tree with character sequences labeling edges– nodes labeled with a subset of original string

IDs– every substring of original input set appears as

a root-edge walk exactly once• root-node walk is considered root-edge walk into

node’s in-edge from parent

Implementation: Suffix Trees– root-edge walk

• Creates string – appears in exactly the strings that

label the node at which it ends

• 2 root-edge walks ending on the same edge

– Both strings created by the walk

occur in exactly the same set of original strings

• Can use ether string example - a root edge walkc

a

g

t

t

a

c g

a

g

t t

g t t c

c g a

Implementation: Solving

• If two substrings occur in exactly the same set of original strings, only one need be considered– Use strings from suffix tree for each uniquely labeled

node

• Build ILP as discussed• Solve ILP using CPLEX• Acquire barcode and signatures for each original

string– signature is the set of substring tests occurring in a

string

Implementation: Example

• strings:1. cagtgc

2. cagttc

3. catgga

• Each node in the suffix tree has a corresponding set of string IDs below it v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Figure 1.2 - table of string labels for each node in suffix tree from figure 1.1

Implementation: ExampleminimizeV18 + V22 + V11 + V17 + V8 #objective functionstV18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimumV18 + V17 + V8 >= 1 #constraint to cover pair 1,2V22 + V11 + V8 >= 1 #constraint to cover pair 1,3V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3binaries #all variables are 0/1V18 V22 V11 V17 V8end

Figure 1.3 - ILP constructed for suffix tree in figure 1.1 using no additional constraints (length, etc)

Figure 1.4 - barcodes

Figure 1.5 - signatures

tg (V18) atgga (V22)

cagtgc 1 0

cagttc 0 0

catgga 1 1

cagtgc {“tg”}

cagttc

catgga {“tg”, “atgga”}

Implementation: Extensions

• minimum and maximum lengths on signature substrings

• acquire barcodes/signatures for only a subset of input strings (wrt to whole set)

• minimum string edit distance between chosen signature substrings

• redundancy– require r signature substrings to differentiate each pair– adds a higher level of confidence that signatures remain

valid even with mutations

Results: Summary

• Works quickly on most moderately sized datasets (especially when redundancy >= 2)– dataset properties

• ~50k virus genomes taken from NCBI (Genbank)• 50-150 virus genomes• average length of each genome ~1000 characters• total input size ranged from approximately 50,000 – 150,000

characters• increasing dataset size scaled approximately linearly

– reach 25% gap (at most 1/3 more than optimum) in just a few minutes

– reach small gap (often < 1%) in 4 hours

Results: Summary

– increasing redundancy greatly decreases run time and % gap at 4 hours in all cases tested

Figure 2.1 - effect of redundancy on avg 25% gap Figure 2.2 - effect of redundancy on avg gap at 4 hours

Conclusion

• Practical sized testing sets obtained on reasonable sized input datasets– testing set consisting of 50 – 270 substring tests on input sets

of ~100 genomes– works well with reactions that have high number of assays

(substring tests) per reaction• GeneChip – 400 assays per reaction

• Redundancy– Good concept in theory– Reduces solution space and hence computation time– GeneChip makes higher number of assays needed cost-

effective

Future Work

• Expand to work on even larger datasets

• Improve ILP solving– use other ILP approximations

• Determine if unconstrained String Barcoding is NP-hard

• More Applications?

Documents

String Barcoding