DNA Compression (Encoded using Huffman Encoding Method)

  • View
    74

  • Download
    5

  • Category

    Science

Preview:

Citation preview

DNA compression(Encoded using Huffman Coding Method)

Marwa K. Al-RikabyUniversity of Babylon/ College of Information

Technology

DNAOne of the building blocks in the organisms bodies.Consists of four chemical bases:

Adenine (A). Thymine (T). Cytosine (C). Guanine (G).

DNA bases pair up with each other, A with T and C with G, to form units called base pairs.

DNA in humans contains around 3 billion bases and these are similar in two persons for about 99% of the total bases.

DNA Compression BasesGoal: analyzing, saving space and time.The DNA sequences constructed from the alphabet {A,

T, C, G}, and those sequences have various repeats usually approximate.

Only lossless algorithms are valid.DNA compression model is preferred to be:

Based on a biological knowledge.Give compression.Simple, few parameters.Can give per symbol information content.Efficient algorithm.

How to compress DNA?Since DNA sequences only contain the four bases {a, c, g,

t} they can be stored using two bits per input symbol.The standard compression tools, such as gzip and bzip,

usually fail to achieve any compression since they use more than two bits per symbol.

When compressing 229354 bases (57338 bytes), we get:

HEHCMVCG: 57338 bytes (without compression). gzip: 66741 bytes (negative compression).

bzip2: 62169 bytes (negative compression).

How to compress DNA?In the case of multiple genomes from the same species,

associated with ‘resequencing’ technologies, the flat text file approach is clearly wasteful since for the most part the sequences are identical.

A simple approach is to store a reference sequence, and then for each other sequence, encode only the differences (or ‘deltas’) with respect to the original sequence.

Consider the sequences AACGACTAGTAATTTG and CACGTCTAGTAATGTG which are identical, except for a substitution in position 1 (A→C), 5 (A→T) and 14 (T→G). Each SNP can be encoded by a pair (i, X), where i is an integer encoding the position and X represents the value of the substitution relative to the reference.

How to compress DNA? Although the basic idea is easy to understand, and not new, a

precise implementation requires addressing a number of important technical issues:

One can use local relative addresses, i.e. intervals, rather than absolute addresses. Using intervals, the above example ‘1C5T14G’becomes ‘0C4T9G’. With intervals the dynamic range of the integers to be encoded may be considerably smaller than with absolute addresses. The relatively modest price to pay is that intervals must be added to recover absolute coordinates.

If the positions at which variations occur in the population are fixed and form a relatively small subset of all possible positions, then additional savings may result by focusing only on those positions.

The choice of the reference sequence.

How to compress DNA?All applications of the basic ideas hinge on a fundamental

technical problem: how to encode integers, representing for instance absolute or relative genomic addresses or read lengths, into binary strings?

we are interested in binary encoding schemes for sequences of integers that can be parsed automatically and that, consistently with information theory, are entropy efficient, in the sense that fewer bits are used to encode more frequent events.

How to compress DNA? Common components of most of DNA

compression algorithms:

Finding the candidate repeat segments. Considering approximate repeats. Selecting the best subset of compatible

repeats. Encoding of the repeat segments. Encoding of the non-repeat segments.

How to compress DNA?Suppose we have the following DNA sequence: 

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its

repetitions in the example.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA? The total number of “TGATAG” occurrences is

14.

All segments repetitions should be indicated in this way.

The counted numbers are kept for using in the encoding.

How to compress DNA?2. Considering approximate repeats: Scanning the sequence to find out any similarity

between the segments, i.e. segments can be identical after applying any operation from the four basic operations:

Insertion: “AAATTCG”==“AAATTCTG” after Ins(T,6). Deletion: “AAATTCG”==“AAATTG” after Del(5,1). Replacement: “AAATTCG”==“AATTTCG” after

Rep(2,T). Reverse: “AAATTCG”==“GCTTAAA” after Rev().

How to compress DNA?2. Considering approximate repeats:let “ATATGA” be a reference segment, then “ATATCA” is

identical to it if we replace “G” by “C”

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?2. Considering approximate repeats: “ATAGA” is identical to “ATATGA” when deleting “T” at

position 3.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?2. Considering approximate repeats: “ATATGA” is identical to “ATAGA” when deleting “T” at

position 3.“GGCGC” is identical to “GGCGG” when replacing “C” by “G”

at position 4. TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?2. Considering approximate repeats: “AATGG” is identical to “GGTAA” when reversing it.

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

How to compress DNA?3. Selecting the best subset of compatible repeats: The choosing of the reference segment is a major and a very

sensitive process since the design of the reference sequence impacts not only the variants to be recorded, but also the intervals, and therefore it must also take into consideration any constraints a particular implementation may place on the intervals and their encodings.

In our example, The segments that we have detected should have integer numbers pointing to its indexes in the reference table.

Segment IndexA 0T 1C 2

G 3

TGATAG 4

ATATGA 5

AAATTCG 6

GGTAA 7

GGCGC 8

RepC 9

Del 10

InsT 11

Rev 12

RepG 13

RepT 14

The reference table contains:

• The four basic symbols {A, T, G, C}.

• The candidates segments.

• The basic operations, each one with the available parameters applied on the sequence.

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must

be counted in the same way shown in step 1.Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1 +1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10InsT 11Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1 +1 +1

AAATTCG 6 1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 1

InsT 11Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 1 +1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 1 +1

InsT 11Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 1 +1 +1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 1

Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 1 +1 +1 +1

GGTAA 7 1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 1 +1

Rev 12RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 1 +1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 1 +1 +1

GGCGC 8 1

RepC 9Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 3

GGCGC 8 1 +1

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1C 2G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0T 1 19

C 2G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0 28

T 1 19

C 2G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0 28

T 1 19

C 2 14

G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same

way shown in step 2. Segment Index repetitions

A 0 28

T 1 19

C 2 14

G 3 25

TGATAG 4 14ATATGA 5 3

AAATTCG 6 4

GGTAA 7 2

GGCGC 8 2

RepC 9 1

Del 10 2

InsT 11 2

Rev 12 1

RepG 13 1

RepT 14 1

TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA

Encoding by Huffman method

First, find each segment probability:

Segment Index repetitions

probability

A 0 28 28/119

T 1 19 19/119

C 2 14 14/119

G 3 25 25/119

TGATAG 4 14 14/119

ATATGA 5 3 3/119

AAATTCG

6 4 4/119

GGTAA 7 2 2/119

GGCGC 8 2 2/119

RepC 9 1 1/119

Del 10 2 2/119

InsT 11 2 2/119

Rev 12 1 1/119

RepG 13 1 1/119

RepT 14 1 1/119

No. of segments = 119

Encoding by Huffman method

Arrange the segments in

non-decreasing order according to its probability.

14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

Build Huffman Coding Tree

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

01

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

01

0

0

1

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

10

1

0

1

01

0

1

01

0

0

1

10

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).

1

1

11

11 1

11

1

1

1

1

0

0

0

0

0

00

0

0

0

0

0

00

1

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Finally, encode the segments via reading its code from the root to its leaf.

1

1

11

11 1

11

1

1

1

1

0

0

0

0

0

00

0

0

0

0

0

00

1

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Code(0)=code (A)=10

1

1

11

11 1

11

1

1

1

1

0

0

0

0

00

0

0

0

0

0

00

1

0

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Code(3)=code(G)=00

1

1

11

11 1

11

1

1

1

1

0

0

0

0

00

0

0

0

0

0

00

1

Code(0)=code (A)=10

1

Encoding by Huffman method14131291110875642130

1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119

2/119

2/119

4/119

4/119

4/119

7/119

8/119

11/119

19/119

28/119

38/119

53/119

66/119

119/119

Code(9)=code(RepC)=1100111

1

11

11 1

11

1

1

1

1

00

0

0

0

00

0

0

0

0

0

00

1Code(3)=code(G)=00

Code(0)=code (A)=10

Encoding by Huffman methodSegment Index Repetition

sProbability

Code

A 0 28 28/119 1 0

T 1 19 19/119 1 1 1

C 2 14 14/119 0 1 1

G 3 25 25/119 0 0

TGATAG 4 14 14/119 0 1 0

ATATGA 5 3 3/119 1 1 0 1 1 0

AAATTCG 6 4 4/119 1 1 0 1 1 1

GGTAA 7 2 2/119 1 1 0 1 0 1

GGCGC 8 2 2/119 1 1 0 1 0 0

RepC 9 1 1/119 1 1 0 0 1 1 1

Del 10 2 2/119 1 1 0 0 0 1

InsT 11 2 2/119 1 1 0 0 0 0

Rev 12 1 1/119 1 1 0 0 1 1 0

RepG 13 1 1/119 1 1 0 0 1 0 1

RepT 14 1 1/119 1 1 0 0 1 0 0

The final reference table is:

Keep in mind that only the segments and the codes are important for the decoder.

Encoding by Huffman method

The previous coding satisfy both prefix property and the information theory in that :• There is no code given for a segment is a prefix in an other segment code.•The shortest codes given to segments that are more frequent while long ones assigned to those which are less frequent.

Thank you

Recommended