22
v ABSTRACT The use of compression techniques in various fields of data management is very encouraging lately. DNA data size becomes large, and this causes a problem of storage and data transfer. Common approach used is to put this datum into the server which adds to the cost of data management. Furthermore, the transfer of online data is not the best solution anymore. For research center that has a low speed of Internet connection, the transfer is almost impossible to implement. This study proposed an enhancement of LZ77 algorithm, which is the common non-greedy, data dictionary type, using sliding windows concept for alphabethical data compression. By introducing sectioning sliding windows with hash table approach, the proposed compression algorithm can solve the storage problem of large DNA sequences. This implementation can speed up time and improve data compression rates. Two formats of DNA data (binary and FASTA) are tested and analysed. Simulation proved that, data compression rate shows promising results, with the addition of proportional size of the DNA, where it can compress at the rate of 56% per bit. Comparing to the LZ77 based DNA compression algorithm, BioCompress which has 44% of compress rate; the proposed algorithm has outperformed by 12%. Implications from this study will allow cost reduction in handling large scale DNA data.

A1 Thesis Status Declaration - eprints.utm.myeprints.utm.my/id/eprint/21289/7/NorAzharAhmadMFSKSM2010.pdf · introducing sectioning sliding windows with hash table approach, the proposed

Embed Size (px)

Citation preview

v

ABSTRACT

The use of compression techniques in various fields of data management is

very encouraging lately. DNA data size becomes large, and this causes a problem of

storage and data transfer. Common approach used is to put this datum into the server

which adds to the cost of data management. Furthermore, the transfer of online data

is not the best solution anymore. For research center that has a low speed of Internet

connection, the transfer is almost impossible to implement. This study proposed an

enhancement of LZ77 algorithm, which is the common non-greedy, data dictionary

type, using sliding windows concept for alphabethical data compression. By

introducing sectioning sliding windows with hash table approach, the proposed

compression algorithm can solve the storage problem of large DNA sequences. This

implementation can speed up time and improve data compression rates. Two formats

of DNA data (binary and FASTA) are tested and analysed. Simulation proved that,

data compression rate shows promising results, with the addition of proportional size

of the DNA, where it can compress at the rate of 56% per bit. Comparing to the

LZ77 based DNA compression algorithm, BioCompress which has 44% of

compress rate; the proposed algorithm has outperformed by 12%. Implications from

this study will allow cost reduction in handling large scale DNA data.

vi

ABSTRAK

Penggunaan teknik pemampatan di dalam pelbagai bidang pengurusan data

amat menggalakkan sejak akhir-akhir ini. Namun dengan wujudnya pelbagai teknik

terkini, saiz data DNA menjadi semakin besar, dan ini menyebabkan masalah

penyimpanan dan pemindahan data berlaku. Pendekatan yang biasa digunakan

adalah dengan meletakkan data ini ke dalam pelayan namun menambahkan lagi kos

pengurusan data. Bagi pusat kajian dengan capaian internet yang rendah,

pemindahan ini hampir mustahil untuk dilaksanakan. Kajian membincangkan

mengenai penambahbaikan algoritma LZ77, di mana ianya menggunakan konsep

tanpa rakus, berjenis kamus data, dan mengaplikasikan pendekatan tingkap

gelangsar, untuk pemampatan data berjenis jujukan abjad. Algoritma tersebut

dimajukan lagi dengan irisan tingkap gelangsar, beserta pendekatan jadual

cincangan. Metod ini mengurangkan masa pemampatan dan meningkatkan kadarnya.

Dua format data DNA (binari dan FASTA) telah diuji dan dianalisis. Hasil simulasi

berkadaran dengan penambahan saiz DNA, di mana ia mampu memampat pada

kadar 56% per bit. Bagi tujuan perbandingan, kadar tersebut mengatasi teknik

pemampatan berasaskan LZ77 terkini iaitu BioCompress, di mana ia mampu

memampat hanya pada kadar 44%; lebih tinggi sebanyak 12%. Kajian ini mampu

mengurangkan kos di dalam penyelenggaraan data DNA yang bersaiz besar.

vii

TABLE OF CONTENT

CHAPTER TITLE PAGE

DECLARATION ii

DEDICATION iii

ACKNOWLEDGEMENT iv

ABSTRACT v

ABSTRAK vi

TABLE OF CONTENT vii

LIST OF TABLES xi

LIST OF FIGURES xii

LIST OF TERMINOLOGIES xiv

LIST OF ABBREVIATIONS xv

LIST OF APPENDICES xvi

1 INTRODUCTION 1

1.1 Overview 1

1.2 Background of DNA Sequencing 3

1.2.1 DNA Sequence Identification 5

1.2.2 Large-scale DNA Sequencing 9

viii

1.2.3 Benefits of Genome Research 10

1.3 Motivation of the Research 11

1.4 Statement of the Problem 13

1.5 Objective of the Study 14

1.6 Scope of the Study 14

1.7 Thesis Outline 16

2 LITERATURE REVIEW 17

2.1 Overview 17

2.2 Basic Compression Method 18

2.3 Dictionary Compression Method 19

2.3.1 The Huffman Coding 20

2.3.2 Limpel-Ziv Coding 21

2.4 Dictionary Based Compression 23

2.4.1 LZ77 24

2.4.2 LZW 27

2.5 Existing DNA Compression Algorithm 30

2.5.1 BioCompress 31

2.5.2 GenCompress 33

2.5.3 DNACompress 34

2.6 Hash Table 35

2.6.1 Hash Function 39

2.7 Discussion 42

3 RESEARCH METHODOLOGY 44

3.1 Overview 44

3.2 Preliminary Study 45

3.3 Research Framework 46

3.4 Research Approach 50

ix

3.4.1 Function Based Algorithm 51

3.4.2 Reconstructing LZ77 Algorithm and

Hash Table Approach Code

51

3.4.3 Combination between LZ77 and

Hash Table

52

3.4.4 Testing and Simulate with Data 52

3.4.5 Database 53

3.4.6 Hardware and Software 53

3.5 Hash Table Allocation 54

3.5.1 Write Method 55

3.5.2 Read Method 56

3.6 Compression Algorithm / Code 57

3.7 Decompression Algorithm / Code 60

3.8 Summary 62

4 IMPLEMENTATION AND RESULT 63

4.1 Overview 63

4.2 DNA Sequence for Sampling 64

4.3 DNA Sequence Data Set 65

5.3.1 FASTA Format 65

5.3.2 Binary Format 66

4.4 The Compression Metrics 67

5.4.1 Bit Rate 67

5.4.2 Percentage of Compression 68

5.4.3 Time Consumption 68

4.5 Experiments Result 69

4.6 Summary 71

x

5 ANALYSIS AND DISCUSSION 72

5.1 Overview 72

5.2 Bit Rate Discussion among Algorithm 73

5.3 Compression Percentage between FASTA

and Binary Format

75

5.4 Speed to Compress /

Decompress FASTA

77

5.5 Prediction Graph for Large Scale DNA

Sequence

78

6 CONCLUSION AND FUTURE WORKS 81

6.1 Discussion 81

6.2 Future Works 83

6.2.1 A Solution for Binary Data Type 83

6.2.2 Multiple Architecture Capabilities 83

6.2.3 Greedy Method 84

6.2.4 Data Transfer Through Network 84

6.2.5 Pattern of DNA Sequence 85

6.3 Conclusion 85

REFERENCES 87

APPENDIX A 92

APPENDIX B 94

CHAPTER 1

INTRODUCTION

1.1 Overview

Data compression is important in making maximal use of limited information

storage and transmission capabilities. One might think that as such capabilities

increase, data compression would become less relevant. But so far this has not been

the case, since the volume of data always seems to increase more rapidly than

capabilities for storing and transmitting it. Wolfram (2002) said, in the future,

compression is always likely to remain relevant when there are physical constraints,

such as transmission by electromagnetic radiation that is not spatially localized.

They are many types of specified compression such as text, image and sound.

In this research, the DNA sequences will be the subject of experiments. They consist

of a specified kind of text only. The deoxyribonucleic acid (DNA) constitutes the

physical medium in which all properties of living organisms are encoded. Biological

2

database such as EMBL, GenBank, and DDBJ, were developed around the world to

store nucleotide sequence (DNA, RNA) and amino-acid sequences of proteins, and

the improvement and addition of those entity sizes, increase nowadays exponentially

fast (Grumbach and Tahi ,1994). Not as big as some other scientific databases, their

size is in hundred of gigabyte.

The first ever compression was invented in 1838, the Morse code for use in

telegraphy. It applies data compression based on shorter codeword for letters such as

"e" and "t" that are more common in English. In 1949 Claude Shannon and Robert

Fano develop a systematic way to assign codeword based on probability of blocks,

(Wolfram, 2002) In the mid-1970s, the idea emerged of dynamically updating

codeword for Huffman encoding, based on the actual data encountered, (Huffman,

1952). And in the late 1970s, with online storage of text files becoming common,

software compression programs began to be developed, almost all based on adaptive

Huffman coding. In 1977 Abraham Lempel and Jacob Ziv (1977, 1978) suggested

the basic idea of pointer-based encoding. In the mid-1980s, following work by Terry

Welch (1984), the so-called LZW algorithm rapidly became the method of choice

for most general-purpose compression systems. It was used in programs such as

PKZIP, as well as in hardware devices such as modems, (Nevill, Witten and Olson,

1996).

This research focuses on enhancement of recently used character

compression to solve large scale DNA Sequence. The selected large scale genes will

be tested using this research scheme. Next section will discuss the background of

DNA sequencing that will lead to understanding why the compression for large-

scale DNA sequence must be done, in motivation of research. The goal and

objectives of the research are presented in section 1.4 and the scope of research is

presented in section 1.5. The thesis outlines will be introduced in section 1.6.

3

1.2 Background on DNA Sequencing

Finding a single gene amid the vast stretches of DNA that makes up the

human genome - three billion base-pairs' worth - requires a set of powerful tools.

The Human Genome Project (HGP) was devoted to develop new and better tools to

make gene hunts faster, cheaper and practicable for almost any scientist to

accomplish said (Watson, 1990; Francis et.al., 1998).

These tools include genetic maps, physical maps and DNA sequence - which

is a detailed description of the order of the chemical building blocks, or bases, in a

given stretch of DNA. Indeed, the monumental achievement of the HGP was its

successful sequencing of the entire length of human DNA, also called the human

genome, (Adams et.al., 1991).

Scientists need to know the sequence of bases because it tells them the kind

of genetic information that is carried in a particular segment of DNA. For example,

they can use sequence information to determine which stretches of DNA contain

genes, as well as to analyze those genes for changes in sequence, called mutations,

that may cause disease.

DNA sequencing involves a process of Polymerase Chain Reaction or PCR.

The purpose of sequencing is to determine the order of the nucleotides of a gene.

This order is the key to the understanding of the human genome. Frederick Sanger,

was first accredited with the invention of DNA sequencing techniques as said by

Roberts(1987).

4

Sanger's approach involved copying DNA strands which would show the

location of the nucleotides in the strands though the use of X-Ray machines. This

technique is very slow and tedious, usually taking many years to sequence only a

few million letters in a string of DNA that often contains hundreds of millions or

even billions of letters. Modern techniques make used of fluorescent tags instead of

X-rays. This significantly reduced the time required to process a given batch of

DNA.

In 1991, working with Nobel laureate Hamilton Smith, Venter's genomic

research project (TIGR) created a bold new sequencing process coined

'shotgunning’.(Weber and Myers, 1997).

"Using an ordinary kitchen blender, they would shatter the organism's DNA

into millions of small fragments, run them through the sequencers (which can read

500 letters at a time), then reassemble them into full genomes using a high speed

computer and novel software written by in-house computer "(Weber and Myers,

1997).

This new method not only uses super fast automated machines, but also the

fluorescent detection process and the PCR DNA copying procedure. This method is

very fast and accurate compared to older techniques.

5

1.2.1 DNA Sequence Identification

DNA sequencing is a complex nucleotide-sequencing technique including

three identifiable steps:

1. Polymerase Chain Reaction (PCR)

2. Sequencing Reaction

3. Gel Electrophoresis & Computer Processing

Chromosomes (Roberts, 1987), which range from 50 million to 250 million

bases, must first be broken into much shorter pieces (PCR step). Each short piece is

used as a template to generate a set of fragments that differ in length from each other

by a single base that will be identified in a later step (template preparation and

sequencing reaction steps).

The fragments in a set are separated by gel electrophoresis (separation step).

New fluorescent dyes allow separation of all four fragments in a single lane on the

gel.

6

Figure 1.1: The Separation of the Molecules with Electrophoresis

The final base at the end of each fragment is identified (base-calling step).

This process recreates the original sequence of As, Ts, Cs, and Gs for each short

piece generated in the first step. Current electrophoresis limits are about 500 to 700

bases sequenced per read. Automated sequencers analyze the resulting

electropherograms and the output is a four-color chromatogram showing peaks that

represent each of the 4 DNA bases as shown in Figure 1.1

The fluorescently labeled fragments that migrate through the gel are passed

through a laser beam at the bottom of the gel. The laser exits the fluorescent

molecule, which sends out light of a distinct color. That light is collected and

focused by lenses into a spectrograph. Based on the wavelength, the spectrograph

separates the light across a CCD camera (charge coupled device). Each base has its

own color, so the sequencer can detect the order of the bases in the sequenced gene

as shown in Figure 1.2.

7

Figure 1.2: The Scanning and |Detection System on the ABI Prism 377

Sequencer

After the bases are "read," computers are used to assemble the short

sequences (in blocks of about 500 bases each, called the read length) into long

continuous stretches that are analyzed for errors, gene-coding regions, and other

characteristics. It will use the ABI Prism 377 sequencer as shown in Figure 1.2.

8

Figure 1.3 A Snapshot of the Detection of the Molecules on the

Sequencer

After the sequencer successes his job, the window similar like Figure 1.3 will

be shown. Each dot and color represents for each A, C, T, and G code. This image,

will be studied and produce a DNA Sequence.

At the end, the DNA data will be provided to public, to solve human needs.

Figure 1.4 is a summary of DNA sequencing steps.

9

Figure 1.4: DNA Sequencing Work Flow Summary

1.2.2 Large-scale DNA Sequencing

The evolution of Human Genome Project (HGP), promises that all organism

cells can be mapped. The human genome is about three billion (3,000,000,000) base

pair longs (Collins et. al., 2003) if the average fragment length is 500 bases, it would

take a minimum of 6 million (3 billion/500) to sequence the human genome (not

allowing for overlap = 1-fold coverage). Keeping track of such a high number of

10

sequences presents significant challenges, only held down by developing and

coordinating several procedural and computational algorithms, such as efficient

database development and management.

Advancement of this knowledge will motivate another research towards

completing another genome project. Therefore, a huge database with a good

algorithm will make this large scale DNA Sequencing reliable and can be done,

without limitations.

1.2.3 Benefits of Genome Research

Rapid progress in genome science and a glimpse into its potential

applications have spurred observers to predict that biology will be the foremost

science of the 21st century. Technology and resources generated by the Human

Genome Project and other genomics research are already having a major impact on

research across the life sciences. The potential for commercial development of

genomics research presents U.S. industry with a wealth of opportunities, and sales of

DNA-based products and technologies in the biotechnology industry are projected to

exceed $45 billion by 2009 (Consulting Resources Corporation Newsletter, Spring

1999).

Technology and resources promoted by the HGP are starting to have

profound impacts on biomedical research and promise to revolutionize the wider

spectrum of biological research and clinical medicine. Increasingly detailed genome

maps have aided researchers seeking genes associated with dozens of genetic

11

conditions, including myotonic dystrophy, fragile X syndrome, neurofibromatosis

types 1 and 2, inherited colon cancer, Alzheimer's disease, and familial breast

cancer.

On the horizon is a new era of molecular medicine characterized less by

treating symptoms and more by looking to the most fundamental causes of disease.

Rapid and more specific diagnostic tests will make possible earlier treatment of

countless maladies. Medical researchers also will be able to devise novel therapeutic

regimens based on new classes of drugs, immunotherapy techniques, avoidance of

environmental conditions that may trigger disease, and possible augmentation or

even replacement of defective genes through gene therapy.

Another benefits including

� Decoding of microbes

� Finding out about our potential weaknesses and problems

� Finding out evolution and our links with life

� Helping to solve crimes

� Agricultural benefits

1.3 Motivation of the Research

The rapid advancement of next-generation DNA sequencers has been

possible due to vast improvements in computer technology, specifically in speed and

size. These new systems produce enormous amounts of data - one run could

generate close to one terabytes of data - and bioinformatics and data management

12

tools have to play catch-up to trigger the analysis and storage of this datum.

Data management and storage will always be an issue for the life science and

medical research industries, and is something that vendor will constantly have to

improve to appease the research world. Luckily, there is hope for software vendors.

Researchers will only begin to warm to the idea that next-generation technologies

produce better data, and will provide time- and cost-savings, if there are adequate

software applications to analyze the data.

However, how much researcher spends on storage device, the transmission

problem will occurred. Even transferring data among computers can consists several

hours for 30 gigabyte file, how about some terabytes data?

Therefore, a specific compression technique for DNA compression has been

invented lately. Most of them using, LZ77 idea, because of the dictionary function

that helps sequential data easily to compress. Many compression algorithms are

focusing on the scheme to shorten the process, enhance the compression ratio, and

fasten the process. From Biocompress to newest algorithm, Graph Compression,

these researches care about compression of sequence data. Logically, if one

sequence of nucleotide (GTACCTATG…) is compressed using any technique, it

will reduce its size. For example by using Biocompress algorithm for CHNTXX

sequence, compression rate is 16.26% (Susan, 1998). For more details about existing

DNA Sequence, please refer to Chapter 2.

Base on previous issue (previous section), The Human Genome Project, a

terabytes of DNA Sequence data are not suitable for non-specific DNA compression.

Mostly, a small sequence has been tested, and it has been proven that they can solve

13

the problem. Base on their experiment, the compression ratio become worst when

the data is bigger .

1.4 Statement of the Problem

i. The size of DNA Sequence database and the chain itself, arise

drastically inline with advancement of sequencing technology. The

storage problem will occur soon.

ii. Advancement of LZ77 (LZSS) has always been focusing on general

data, and for DNA sequence many researchers keep testing and

experimenting on popular and small sequences, instead of using it to

iii. A huge data is not compatible for mobile device usage and data

transfer (Kwong and Ho, 2001). A good compression for large scale

data must be implemented to support the mobile technology.

iv. A transfer rate among two research center (Eg. National Center for

Biotechnology Information in United States and Institute of Medical

Research in Malaysia) must be enhanced, to cater the knowledge

transfer. Mostly, large scale data, took a lot of time to transfer.

14

1.5 Objectives of the Study

i. To find the best solution for large scale DNA sequence compression.

This research will be the first ever research focusing on large scale

DNA sequence.

ii. To enhance LZ77 (LZSS) from universal data compression scheme

purposes to suit large scale DNA sequence problem. It is reliable

based on the characteristic of LZ77 which is similar to DNA

sequence.

iii. To study a hash table approach, which has solved many type of data

(e.g. sequence data, picture, and jpeg) and implement it into the LZ77

environment. This approach will make DNA sequence stay in

computer memory while compression / decompress is decompress.

iv. To optimize hash table to suit large scale DNA sequence data with

suitable method. Hash table cannot achieve optimum performance if

the data atmosphere is not suitable for hashing scheme,

1.6 Scope of the Study

Compression of DNA sequence is a huge area in bioinformatics area. A lot

of weighted factor has been identified to compress the sequences. Some of them use

the uniqueness of DNA sequence, the palindrome among sequence and in this

research will focus only on similarities of the characters. However, the latest

compression scheme using dynamic programming did not use any of these factors.

15

They are two types of sequence which has been stored in NCBI databases;

the FASTA and binary format. Sometimes bioinformatics application needs to use

single or both formats. Sadly, all DNA specific compression just focuses on the

FASTA. It will ease the researcher to identify which DNA belongs to whom, instead

of binary. On the other hand, binary is really good on data transmission. They tend

to minimize the size and it will ease the data transmission. This research will use

and analyze DNA sequence data in FASTA and Binary format (text sequence).

In this modern world, a lot of DNA databases (servers) lie in research center.

Some of them only focus on certain world of organism. An example, Bioinformatics

Database in Japan focusing on bacteria, while in Malaysia, they tend to store crops

DNA such as Jathropa and Palm oil. Only one universal server that supports all

databases around the world. The National Center of Biological Information (NCBI)

in United States. Bioinformaticians give a trust to this server, based on capabilities

of storing multiple type of organism including human. This trend will be succeeding

by this research as a primary source.

The special characteristics being highlight for this research is the similarities

among sequences. In computer science, several compression algorithm has been

introduced, and the perfect algorithm suits the needs of this research of LZ77. It uses

sliding window, and create a dictionary to compare with the futures characters.

Using the sliding windows technique, the compression work can be done without

any mistake. The original data still there with the simplify appearance.

16

1.7 Thesis Outline

This section gives a general description of the contents of subsequent

chapters in this thesis. Chapter 2 gives a review of the various techniques to solve

DNA Sequence compression problem. Chapter 3 describes the methodology adopted

to achieve the objectives of this research. Chapter 4 will discuss about the algorithm

construction focus on enhancement of hash table, to suit large DNA sequence.

Chapter 5 will present various experiment using several type of data and

environment. Chapter 6 will summarize the findings of research and future works.