Similarity Analysis by Data Compression

Similarity Analysis by Data Compression

Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig.

CWI is the National Centre of Mathematics and Computer Science in

the Netherlands.

CoSCo is the Complex Systems Computation Research Group.

Peter Grünwald, CWI, Amsterdam

Petri Myllymäki, University of Helsinki, CoSCo

Data Compression…

• Consider two files A and B• Let’s compress these with your favourite general-

purpose data compressor, e.g. gzip • Let L(A) and L(B) be the compressed length (in bits)

of A and B, respectively

…and Similarity

• Suppose we want to compress both A and B.• We can either first compress A and then B

– Resulting length: L(A)+L(B)

• Or we can glue A and B together and compress the resulting file AB– Resulting length L(AB)

…and Similarity

• Suppose we want to compress both A and B.• We can either first compress A and then B

– Resulting length: L(A)+L(B)

• Or we can glue A and B together and compress the resulting file AB– Resulting length L(AB)

CLAIM: if (and only if) A and B are ‘similar’, then

L(AB) << L(A) + L(B)

“Domain-Independent”Notion of Similarity

• Consider same ASCII text in many different languages, e.g., Declaration of Human Rights– English close to German– English reasonable close to French– German farther from French– All three far from, say, Polish

• Consider DNA of different species– Human very close to Chimpanzee, somewhat less close to

Gorilla, even less close from Baboon…and very far from Wheat

• Consider MIDI-files of popular songs…

Background

• For a given compressor with length function L, define Normalized Compression Distance as

• If L is taken to be Kolmogorov complexity, this becomes a “universal metric”– essentially, whenever two objects are close according to

some computable distance function, they will be close according to NCD as well

• For practical applications, use computationally practical general-purpose compressor– gzip, bzip, ppm etc.

Applications

• For a set of N possibly related files, compute N2 pairwise normalized compression distances

• To visualize, create a binary tree such that close objects are close to each other on the tree– e.g. using quartet puzzling method

You can do this at home!

You cannot do this at home!

Pump-Priming

Pre-Pump Priming:

– Theory developed and tested on several data sets at CWI;featured in New Scientist, Pour La Science, Izvestija…

– Successes include: SARS is CORONA

Pump Priming:

1. Development of popular Open-Source Package CompLearn (www.complearn.org, Rudi Cilibrasi)

2. Application of CompLearn and other compression-based methods to stemmatology

http://www.complearn.org/



Compression-Based Methods in Stemmatic Analysis

Legend of St. Henry of Finland, Manuscript H, Helsinki University Library

Before Gutenberg...

• Historical manuscripts were repeatedly copied by hand

• Typical ’errors’ include misspellings, omissions, change of word order, etc....

Manuscript Evolution

• The texts spread out in a number of copies, following a tree-like graph

• Typically only a fraction of the manuscripts remain to our date

Stemmatic Analysis

• Stemmatology: ”Discipline that attempts to reconstruct the transmission of a text on the basis of relations between the various surviving manuscripts.”

• Cf. Phylogenetics: ”The study of evolutionary relatedness among various groups of organisms.”

manuscript individualwritten text DNA

copying reproductionmodification mutation

’contamination’ horizontal transfer

Compression-Based Approach

• Most existing approaches (distance-based methods, parsimonial methods, Bayesian methods, etc.) based on methods developed for biological phylogeny:

• Pascal pump priming compression-based approach for stemmatic analysis

• Cost function: amount of information required to describe B given A.

Constructing the stemma

• Dynamic programming for handling the missing nodes

• With 52 existing documents, the number of trees is about 2.7 x 1078 simulated annealing search

How Does It Work?

• Actually, surprisingly well!

• In Helsinki, we have started a 2-year project with the historians, funded by the Emil Aaltonen Foundation, to study this approach further

The Pascal Computer-Assisted Stemmatology Challenge

• Data set #1: Heinrichi data, collected specifically for this challenge

• Data set #2: The Parzival data - text is beginning of German poem Parzival by Wolfram von Eschenbach (translated to English by A.T. Hatto). Data kindly provided to us by M. Spencer and H. F. Windram

• Data set #3: Notre Besoin - text is from Stig Dagerman's, Notre besoin de consolation est impossible à rassasier, Paris: Actes Sud, 1952 (translated to French from Swedish by P. Bouquet). Data kindly provided to us by Caroline Macé.

Challenge results

• No clear overall winner over all data sets• CompLearn performed very well in Parzival, but poorly in

Heinrichi, why? more research is required• Nice side result: the Heinrichi is internationally a quite

unique data set a platform for future collaboration with other sciences?

Future work

• Analysis of Challenge results– New Challenge?

• Application to the Finnish Cultural Foundation to fund a two-year European research network on stemmatology– built around series of 4-5 international workshops gathering top

experts of the field.– names in application represent various disciplines including

historical studies, theology, philology, computer science, mathematics and biology

• Workshop on information-theoretic approaches to modeling in Helsinki?– July 2008, during ICML, UAI & COLT

Documents

Similarity Analysis by Data Compression