25
Memory-aware BWT by Se gmenting Sequences presented by Jiaying Wang April 12, 2012 Northeastern University, Chi e 14th Asia-Pacific Web Conference (APWeb)

Memory-aware BWT by Segmenting Sequences

  • Upload
    sirius

  • View
    26

  • Download
    1

Embed Size (px)

DESCRIPTION

Memory-aware BWT by Segmenting Sequences. presented by Jiaying Wang April 12 , 2012. The 14th Asia-Pacific Web Conference (APWeb). Northeastern University, China. Motivation. Most interesting massive data sets contain string data (web data, record data, genome data, etc.) - PowerPoint PPT Presentation

Citation preview

Page 1: Memory-aware BWT by Segmenting Sequences

Memory-aware BWT by Segmenting Sequences

presented by Jiaying Wang

April 12, 2012

Northeastern University, China

The 14th Asia-Pacific Web Conference (APWeb)

Page 2: Memory-aware BWT by Segmenting Sequences

Motivation

• Most interesting massive data sets contain string data (web data, record data, genome data, etc.)

• BWT as a full text index provides fast substring search over large text collections

• Enormous memory cost while building BWT(n log n + n logσ)

Page 3: Memory-aware BWT by Segmenting Sequences

Preliminaries

• text: T[0..n − 1], T[i]∈Σ, |Σ| = σ• We add a $ to the end of the text. $ do no

t belong to Σ• T[i...j] is a sequence starting at i position a

nd ending at j position– empty string iff i>j– prefix iff i = 0– suffix iff j = 0

Page 4: Memory-aware BWT by Segmenting Sequences

Problem definition

• Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}.

• We take the memory cost into account.• The process should guarantee the efficien

cy of query and memory cost at the same time.

Page 5: Memory-aware BWT by Segmenting Sequences

Bwt transformation

p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i

i ssippi$mis s

m ississippi $i ssissippi$ m

i ppi$missis s i $mississip p$ mississipp i

LF

11107410986352

SA

mississippi$ississippi$mssissippi$misissippi$misissippi$missssippi$missisippi$missisippi$mississppi$mississipi$mississipi$mississipp$mississippi

bwt: ipssm$pissiimississippi$text:

Page 6: Memory-aware BWT by Segmenting Sequences

Backward search on BWT

L 0, hbwt.length

For i from pat.length-1 to 0

k = pat[i]

l = C[k] + occ(k,l)

h = C[k] + occ(k,h)

Return h - l

searching "ssi"

p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i

i ssippi$mis s

m ississippi $i ssissippi$ m

i ppi$missis s i $mississip p$ mississipp i

LF

Page 7: Memory-aware BWT by Segmenting Sequences

Memory cost analysis

• Enormous memory cost for building BWT.• n log n + n logσ. About 5*n Bytes. (1G 5G)• For example: mississippi

mississippi mississippi$

SA:11 10 7 4 1 0 9 8 6 3 5 2ipssm$pissii

12 12×4+ = 12×5

Page 8: Memory-aware BWT by Segmenting Sequences

Our idea(1/2)

mississippi

missis sippi

search ssi Load one segment each time will help us save the memory

How to find the segmented sequence?

Page 9: Memory-aware BWT by Segmenting Sequences

Our idea(2/2)

mississippi

mississi issippi

search ssi

Oops, we find another one

Page 10: Memory-aware BWT by Segmenting Sequences

BWT on Overlapped Segments

L

l

T

T1

T2

Tk

bwt…

BWT1

BWT2

BWTk

bwt

bwt

Page 11: Memory-aware BWT by Segmenting Sequences

Searching cases

• prerequisite : query length ≤ l

• For the second case, we have to remove duplicates of the results

Page 12: Memory-aware BWT by Segmenting Sequences

Filtering method

Filter interval f = l - m

All the occurrences starting at positions in a filter interval should be filtered.

f

Page 13: Memory-aware BWT by Segmenting Sequences

Searching algorithm

Page 14: Memory-aware BWT by Segmenting Sequences

BWT on Disjoint Segments

T

T1

T2

Tk

bwt…

BWT1

BWT2

BWTk

bwt

bwt

Page 15: Memory-aware BWT by Segmenting Sequences

Searching cases

• For the second case, we need to– 1 Find the suffix of the query as the prefix of a

segment.– 2 Verify rest prefix of the query needs on the l

eft segment.

Page 16: Memory-aware BWT by Segmenting Sequences

Suffix checking

Time complexity: Θ (m)

Page 17: Memory-aware BWT by Segmenting Sequences

Prefix verification

• To verify the prefix, we can– 1 keep text. (waste

space) – 2 revert text on the

fly.(waste a little time)

Page 18: Memory-aware BWT by Segmenting Sequences

Searching algorithm

Page 19: Memory-aware BWT by Segmenting Sequences

Analysis

• Overlap method – Memory cost (n + l + k) × (log σ + log(n + l +

k) − log(k))/k– Time complexity Θ(occ+δ+mk)

• Backwalk method– Memory cost n(log σ+log n−log k)/k bits.– Time complexity Θ(occ + (η + k)m)

Page 20: Memory-aware BWT by Segmenting Sequences

Experiment

• Environment – C++ language – PC with 2.93 GHz Intel Core CPU– 4 GB main memory– Ubuntu operating system (Linux distribution).

• data sets– English text at Pizza&Chili Corpus– Genome sequence at UCSC goldenPath

Page 21: Memory-aware BWT by Segmenting Sequences

Performance on EnglishMemory cost Build time

Query time Query time

Page 22: Memory-aware BWT by Segmenting Sequences

Performance on genomeMemory cost Build time

Query time Query time

Page 23: Memory-aware BWT by Segmenting Sequences

More performance

Page 24: Memory-aware BWT by Segmenting Sequences

Conclusion

• We propose a novel variation of BWT called S-BWT

• Our index save more memory than BWT

• Two query method based on S-BWT

• Our method is faster than BWT method on large text.

Page 25: Memory-aware BWT by Segmenting Sequences

Thank you!

Q&A