1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG

1

Pattern Processing and Searching In RAM

Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan

School of Computing and Information SciencesBioRG Bioinformatics Research Group

Florida International University11200 SW 8th Street

Miami, FL 33199{mrobi002, giri}@cs.fiu.edu

Presented byMichael RobinsonJanuary 15, 2008

At: Florida International UniversityGlobal CyberBridges,

National Science Foundation Program Award Id: OCI-0636031 October 1, 2006-December 31, 2009

mailto:[email protected]








2

Agenda

- What are Suffix Trees – Suffix Arrays

- Suffix Trees – Importance in Bioinformatics

- Main Memory Bottleneck

- Sadakane’s Compressed Suffix Tree Implementation

- Compressed Suffix Tree Problem - Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results

- Example Required File: LCP Array Solution

- Implementation Design – Software

- My Current Work

- Experimental Results

- My Future Work

- References

3

Suffix Tree

BANANAS ANANAS NANAS ANAS NAS AS S

Suffix Trees inventor: Peter Wiener, 1973.

A

N

A

A

N

B

S

S

A N

N S

SA

S

S

S

N

N

A

A

A

12

5

4

3

7

6

4

Suffix Array Implementation Simplified version of Suffix Array.

Lexicographic ordered text.

Sequence = ABRACADABRA

Suffix Array Index Index Sorted ABRACADABRA 0 10 A BRACADABRA 1 7 ABRA RACADABRA 2 0 ABRACADABRA ACADABRA 3 3 ACADABRA CADABRA 4 5 ADABRA ADABRA 5 8 BRA DABRA 6 1 BRACADABRA ABRA 7 4 CADABRA BRA 8 6 DABRA RA 9 9 RA A 10 2 RACADABRA

[1]Suffix Arrays inventors: Udi Manber, Gene Myers 1989

5

Suffix TreesImportance in Bioinformatics

Biological Data Type (A C G T)vs.

Search Engines Data (inverted)

Applying Suffix Trees to Real Genomic Sequences is Impractical

6

Main Memory BottleneckSuffix Array Index Storage

ABRACADABRA 11 BRACADABRA 10 RACADABRA 9 ACADABRA 8 CADABRA 7 ADABRA 6 DABRA 5 ABRA 4 BRA 3 RA 2 A 1 66 = n(n+1)/2 = 11(12)/2 = 66 PA01 6Mg ~ 18 TeraBytes

Human Genome 3,164,700,000 nucleotides(3,164,700,000* 3,164,700,001)/2 = 5,007,663,046,582,350,000 5,007,663 terabytes

Suffix Arrays inventors: Udi Manber, Gene Myers 1989

7

Sadakane’s Compressed Suffix Implementation

A = 00, C = 01, G = 10 T = 11

Storage = n log n bits = 2n bits, ~20% of original space Suffix Array Index Storage uncompressed compressed ABRACADABRA 0 22 bits = 2n = n log n BRACADABRA 1 20 RACADABRA 2 18 ACADABRA 3 16 CADABRA 4 14 ADABRA 5 12 DABRA 6 10 ABRA 7 8 BRA 8 6 RA 9 4 A 10 2 528 bits = 66 166 bits = (n(n+1)/2)*2 bits

Unfortunately it is not linear, 100 mg ~ 5 gig[2]Kunihiko Sadakane

8

Compressed Suffix Tree Problem

Unfortunately the Suffix Tree is not linear100 mg ~ 5 gig

The Sequence is linear ACGT = 4 bases = 2n bits = 8 bits = n log2 ∑(ACGT)

GTCAAGTC = 8 bases = 2n bits = 16 bits = n log2 ∑(ACGT)

But the Suffix Array is not: ACGT = 4 bases = (4(5)/2)*2 = 20 bits = (n(n+1)/2)*2 GTCAAGTC = 8 bases = (8(9)/2)*2 = 72 bits = (n(n+1)/2)*2

In first data structure, 2nd sequence is twice as long as the first one, but

in second data structure, 2nd sequence is more than three times the first one.

It is 30% slower than non-compressed trees.

9

Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results

Algorithms results: Authors SadakaneSpace ∑(AGCT) log2 n n log2 ∑(AGCT) = 2n bits

GTCA 4*2 = 8 bits 2*4 = 8 bitsGTCAAGTC 4*3 = 12 bits 2*8 = 16 bitsTACAAGTAGTCAAGTC 4*4 = 16 bits 2*16 = 32 bits

A 2048 base sequence 4*11= 44 bits 2*2048= 4096 bits

Space needed during construction 1.4 times final space

Authors created an Abstract Suffix Array using:Succinct Suffix Array, based onWavelet Tree (for sound), build on Burrows-Wheeler transform

[3]Niko Välimäki1

10

Example Required File: LCP Array Solution

A Useful additional Data Structure. An array of lengths of the Longest Common Prefixes, between each substring and it’s predecessor in the Suffix Array

Lexicographic ordered text.

Sequence = ABRACADABRA

Suffix Array Index Index LCP Sorted ABRACADABRA 0 10 0 A BRACADABRA 1 7 1 ABRA RACADABRA 2 0 4 ABRACADABRA ACADABRA 3 3 1 ACADABRA CADABRA 4 5 1 ADABRA ADABRA 5 8 0 BRA DABRA 6 1 3 BRACADABRA ABRA 7 4 0 CADABRA BRA 8 6 0 DABRA RA 9 9 0 RA A 10 2 2 RACADABRA

Suffix Arrays inventors: Udi Manber, Gene Myers 1989

11

Implementation Design – Software

- C++ object oriented.

- Each Data Structure is its own class.

- Generic Code, e.i. from Sadakane, retrieve short sequences.

- For construction and retrieve long sequences, new code.

- Tailored code is as time/space efficient as generic code.

12

My Current Work - Approach

- Dissertation, Not Published Yet.- Suffix Arrays Approach.- Google’s Construction Approach.- Construction Time and Space Problems. Sequences From 11 Bases To PA01 with 6.2 Million Bases. PA01: Run on 7 different computers. Fastest Time 5 days.- All Files Contain Uncompressed Information- Space Required: Sequence File = n One Index File = from n to ~ 8 Times PA01 Sequence Size- Loading Time and RAM Space Problems.- Solution: Break Index File into 64, 1024 Sub Indexes Improving Loading, Processing Times and Allowing Processing of Larger Size Sequences.

13

My Current Work - Applications

- Finding Patterns: How Many Times, and Where a Probe Appears in a Given Sequence

acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg

- Finding Inverted Patterns: Same as Finding Patterns plus inverted

acgttg ….. gttgca ….. acgttg ….. gttgca ..… acgttg ….. gttgca

- Finding Inverted Reciprocal Patterns:

acgttg ….. caacgt ….. acgttg ….. caacgt ….. acgttg ….. caacgty

- Above Programs Generate a Text File Report for Further Processing

14

Sadakane’s Experimental Results

Using: One 2.4 Ghz Pentium 4 Computer, with 1 GB ram Red Hat OS, Compiled programs using g++ (GCC)

15

Engineering a Compressed Suffix Tree Implementation Experimental Results

16

My Current Work - Results

Finding Patterns in PA01:6,264,404 bases

Sequence Bases Seconds a = 1056134 3.4370

aa = 190634 0.6410 aaa = 8948 0.0940 aaaa = 6858 0.0150

aaaaa = 2050 0.2340 c = 2102684 6.8120

cc = 595921 1.9530

ccc = 108532 0.3590 cccc = 18417 0.0630

ccccc = 2883 0.0160 g = 2066632 6.8120 gg = 575879 2.0470 ggg = 101892 0.3440

gggg = 16526 0.0630 ggggg = 2506 0.0160

t = 1038950 3.3750 tt = 185978 0.6400 ttt = 27693 0.0930

tttt = 6659 0.0150 ttttt = 1904 0.0000

17

My Future Work Do Construction for:Human Genome and All Pseudomonas aeruginosa Bacterias

Consensus Pattern Search: • To solve the Bioinformatics Consensus Problem to n-1 of a given probe. At

the present time there are applications that solve this problem to value 3.

• For a probe with 50 bases, with alphabet A C G T, if we check for 3 mutations we need to do 4 * 4mutations-1 = 64 pattern searches, for each group of 3 bases.

• For a sequence of 3.6 billion bases, a probe of 1,000 bases, and a mutation rank of n-1, we need to do 4 * 4999 pattern searches on the 3.5 billion sequence.

• Excel calculates up to 4* 4511 = 4.4942E+307. The only way to do this work is with a Distributed System.

• Solving this problem for proteins will require more time because proteins have an alphabet of length 20.

18

Conclusions

• Due to Advances in computer hardware and reduction on prices, today a 1.5 terabyte hard disk costs around 400 US dollars.

• Recent implementations of Suffix Trees an Suffix Arrays concentrate on compressing the data causing large delays in user processing.

• We believe the previous bottleneck hard disk space problems have been resolved, therefore compressing data on hard disk is no longer necessary specially when the users applications slowdown to factors of 30 for Suffix Trees, and additionally log n for Suffix Arrays, when compared to uncompressed data.

• With the advances on Operating Systems accessing ram memories of 128 gigabytes in workstations, and with advances in Distributing (Grid) Computing, we believe that using uncompressed data with new methods like our implementation, we can produce applications that were not possible before.

19

References

20

References

21

References[1] Udi Manber and Gene Myers (1991). "Suffix arrays: a new method

for on-line string searches". SIAM Journal on Computing, Volume 22, Issue 5 (October 1993), pp. 935-948

[2] Kunihiko Sadakane, Department of Computer Science and Communication Engineering, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan [email protected]

[3] Engineering a Compressed Suffix Tree Implementation Niko Välimäki1, Wolfgang Gerlach2, Kashyap Dixit3, and Veli

Mäkinen1, Department of Computer Science, University of Helsinki, Finland {nvalimak,vmakinen}@cs.helsinki.fi

Technische Fakultät, Universität Bielefeld, Germany [email protected]

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur, India [email protected]








22

Questions

Thank you!!

Presented by:Michael Robinson

Florida International [email protected]

January 15, 2007


Documents

1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG