Upload
harvey-stephens
View
212
Download
0
Embed Size (px)
Citation preview
1
Pattern Processing and Searching In RAM
Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan
School of Computing and Information SciencesBioRG Bioinformatics Research Group
Florida International University11200 SW 8th Street
Miami, FL 33199{mrobi002, giri}@cs.fiu.edu
Presented byMichael RobinsonJanuary 15, 2008
At: Florida International UniversityGlobal CyberBridges,
National Science Foundation Program Award Id: OCI-0636031 October 1, 2006-December 31, 2009
2
Agenda
- What are Suffix Trees – Suffix Arrays
- Suffix Trees – Importance in Bioinformatics
- Main Memory Bottleneck
- Sadakane’s Compressed Suffix Tree Implementation
- Compressed Suffix Tree Problem - Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results
- Example Required File: LCP Array Solution
- Implementation Design – Software
- My Current Work
- Experimental Results
- My Future Work
- References
3
Suffix Tree
BANANAS ANANAS NANAS ANAS NAS AS S
Suffix Trees inventor: Peter Wiener, 1973.
A
N
A
A
N
B
S
S
A N
N S
SA
S
S
S
N
N
A
A
A
12
5
4
3
7
6
4
Suffix Array Implementation Simplified version of Suffix Array.
Lexicographic ordered text.
Sequence = ABRACADABRA
Suffix Array Index Index Sorted ABRACADABRA 0 10 A BRACADABRA 1 7 ABRA RACADABRA 2 0 ABRACADABRA ACADABRA 3 3 ACADABRA CADABRA 4 5 ADABRA ADABRA 5 8 BRA DABRA 6 1 BRACADABRA ABRA 7 4 CADABRA BRA 8 6 DABRA RA 9 9 RA A 10 2 RACADABRA
[1]Suffix Arrays inventors: Udi Manber, Gene Myers 1989
5
Suffix TreesImportance in Bioinformatics
Biological Data Type (A C G T)vs.
Search Engines Data (inverted)
Applying Suffix Trees to Real Genomic Sequences is Impractical
6
Main Memory BottleneckSuffix Array Index Storage
ABRACADABRA 11 BRACADABRA 10 RACADABRA 9 ACADABRA 8 CADABRA 7 ADABRA 6 DABRA 5 ABRA 4 BRA 3 RA 2 A 1 66 = n(n+1)/2 = 11(12)/2 = 66 PA01 6Mg ~ 18 TeraBytes
Human Genome 3,164,700,000 nucleotides(3,164,700,000* 3,164,700,001)/2 = 5,007,663,046,582,350,000 5,007,663 terabytes
Suffix Arrays inventors: Udi Manber, Gene Myers 1989
7
Sadakane’s Compressed Suffix Implementation
A = 00, C = 01, G = 10 T = 11
Storage = n log n bits = 2n bits, ~20% of original space Suffix Array Index Storage uncompressed compressed ABRACADABRA 0 22 bits = 2n = n log n BRACADABRA 1 20 RACADABRA 2 18 ACADABRA 3 16 CADABRA 4 14 ADABRA 5 12 DABRA 6 10 ABRA 7 8 BRA 8 6 RA 9 4 A 10 2 528 bits = 66 166 bits = (n(n+1)/2)*2 bits
Unfortunately it is not linear, 100 mg ~ 5 gig[2]Kunihiko Sadakane
8
Compressed Suffix Tree Problem
Unfortunately the Suffix Tree is not linear100 mg ~ 5 gig
The Sequence is linear ACGT = 4 bases = 2n bits = 8 bits = n log2 ∑(ACGT)
GTCAAGTC = 8 bases = 2n bits = 16 bits = n log2 ∑(ACGT)
But the Suffix Array is not: ACGT = 4 bases = (4(5)/2)*2 = 20 bits = (n(n+1)/2)*2 GTCAAGTC = 8 bases = (8(9)/2)*2 = 72 bits = (n(n+1)/2)*2
In first data structure, 2nd sequence is twice as long as the first one, but
in second data structure, 2nd sequence is more than three times the first one.
It is 30% slower than non-compressed trees.
9
Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results
Algorithms results: Authors SadakaneSpace ∑(AGCT) log2 n n log2 ∑(AGCT) = 2n bits
GTCA 4*2 = 8 bits 2*4 = 8 bitsGTCAAGTC 4*3 = 12 bits 2*8 = 16 bitsTACAAGTAGTCAAGTC 4*4 = 16 bits 2*16 = 32 bits
A 2048 base sequence 4*11= 44 bits 2*2048= 4096 bits
Space needed during construction 1.4 times final space
Authors created an Abstract Suffix Array using:Succinct Suffix Array, based onWavelet Tree (for sound), build on Burrows-Wheeler transform
[3]Niko Välimäki1
10
Example Required File: LCP Array Solution
A Useful additional Data Structure. An array of lengths of the Longest Common Prefixes, between each substring and it’s predecessor in the Suffix Array
Lexicographic ordered text.
Sequence = ABRACADABRA
Suffix Array Index Index LCP Sorted ABRACADABRA 0 10 0 A BRACADABRA 1 7 1 ABRA RACADABRA 2 0 4 ABRACADABRA ACADABRA 3 3 1 ACADABRA CADABRA 4 5 1 ADABRA ADABRA 5 8 0 BRA DABRA 6 1 3 BRACADABRA ABRA 7 4 0 CADABRA BRA 8 6 0 DABRA RA 9 9 0 RA A 10 2 2 RACADABRA
Suffix Arrays inventors: Udi Manber, Gene Myers 1989
11
Implementation Design – Software
- C++ object oriented.
- Each Data Structure is its own class.
- Generic Code, e.i. from Sadakane, retrieve short sequences.
- For construction and retrieve long sequences, new code.
- Tailored code is as time/space efficient as generic code.
12
My Current Work - Approach
- Dissertation, Not Published Yet.- Suffix Arrays Approach.- Google’s Construction Approach.- Construction Time and Space Problems. Sequences From 11 Bases To PA01 with 6.2 Million Bases. PA01: Run on 7 different computers. Fastest Time 5 days.- All Files Contain Uncompressed Information- Space Required: Sequence File = n One Index File = from n to ~ 8 Times PA01 Sequence Size- Loading Time and RAM Space Problems.- Solution: Break Index File into 64, 1024 Sub Indexes Improving Loading, Processing Times and Allowing Processing of Larger Size Sequences.
13
My Current Work - Applications
- Finding Patterns: How Many Times, and Where a Probe Appears in a Given Sequence
acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg
- Finding Inverted Patterns: Same as Finding Patterns plus inverted
acgttg ….. gttgca ….. acgttg ….. gttgca ..… acgttg ….. gttgca
- Finding Inverted Reciprocal Patterns:
acgttg ….. caacgt ….. acgttg ….. caacgt ….. acgttg ….. caacgty
- Above Programs Generate a Text File Report for Further Processing
14
Sadakane’s Experimental Results
Using: One 2.4 Ghz Pentium 4 Computer, with 1 GB ram Red Hat OS, Compiled programs using g++ (GCC)
15
Engineering a Compressed Suffix Tree Implementation Experimental Results
16
My Current Work - Results
Finding Patterns in PA01:6,264,404 bases
Sequence Bases Seconds a = 1056134 3.4370
aa = 190634 0.6410 aaa = 8948 0.0940 aaaa = 6858 0.0150
aaaaa = 2050 0.2340 c = 2102684 6.8120
cc = 595921 1.9530
ccc = 108532 0.3590 cccc = 18417 0.0630
ccccc = 2883 0.0160 g = 2066632 6.8120 gg = 575879 2.0470 ggg = 101892 0.3440
gggg = 16526 0.0630 ggggg = 2506 0.0160
t = 1038950 3.3750 tt = 185978 0.6400 ttt = 27693 0.0930
tttt = 6659 0.0150 ttttt = 1904 0.0000
17
My Future Work Do Construction for:Human Genome and All Pseudomonas aeruginosa Bacterias
Consensus Pattern Search: • To solve the Bioinformatics Consensus Problem to n-1 of a given probe. At
the present time there are applications that solve this problem to value 3.
• For a probe with 50 bases, with alphabet A C G T, if we check for 3 mutations we need to do 4 * 4mutations-1 = 64 pattern searches, for each group of 3 bases.
• For a sequence of 3.6 billion bases, a probe of 1,000 bases, and a mutation rank of n-1, we need to do 4 * 4999 pattern searches on the 3.5 billion sequence.
• Excel calculates up to 4* 4511 = 4.4942E+307. The only way to do this work is with a Distributed System.
• Solving this problem for proteins will require more time because proteins have an alphabet of length 20.
18
Conclusions
• Due to Advances in computer hardware and reduction on prices, today a 1.5 terabyte hard disk costs around 400 US dollars.
• Recent implementations of Suffix Trees an Suffix Arrays concentrate on compressing the data causing large delays in user processing.
• We believe the previous bottleneck hard disk space problems have been resolved, therefore compressing data on hard disk is no longer necessary specially when the users applications slowdown to factors of 30 for Suffix Trees, and additionally log n for Suffix Arrays, when compared to uncompressed data.
• With the advances on Operating Systems accessing ram memories of 128 gigabytes in workstations, and with advances in Distributing (Grid) Computing, we believe that using uncompressed data with new methods like our implementation, we can produce applications that were not possible before.
19
References
20
References
21
References[1] Udi Manber and Gene Myers (1991). "Suffix arrays: a new method
for on-line string searches". SIAM Journal on Computing, Volume 22, Issue 5 (October 1993), pp. 935-948
[2] Kunihiko Sadakane, Department of Computer Science and Communication Engineering, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan [email protected]
[3] Engineering a Compressed Suffix Tree Implementation Niko Välimäki1, Wolfgang Gerlach2, Kashyap Dixit3, and Veli
Mäkinen1, Department of Computer Science, University of Helsinki, Finland {nvalimak,vmakinen}@cs.helsinki.fi
Technische Fakultät, Universität Bielefeld, Germany [email protected]
Department of Computer Science and Engineering Indian Institute of Technology, Kanpur, India [email protected]
22
Questions
Thank you!!
Presented by:Michael Robinson
Florida International [email protected]
January 15, 2007