19
Similarity Analysis by Data Compression Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig. CWI is the National Centre of Mathematics and Computer Science in the Netherlands. CoSCo is the Complex Systems Computation Research Group. Peter Grünwald, CWI, Amsterdam Petri Myllymäki, University of Helsinki, CoSCo

Similarity Analysis by Data Compression

Embed Size (px)

DESCRIPTION

Similarity Analysis by Data Compression. Peter Gr ü nwald, CWI, Amsterdam Petri Myllymäki, University of Helsinki, CoSCo. Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig. CWI is the National Centre of Mathematics and Computer Science in the Netherlands. - PowerPoint PPT Presentation

Citation preview

Page 1: Similarity Analysis by  Data Compression

Similarity Analysis by Data Compression

Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig.

CWI is the National Centre of Mathematics and Computer Science in

the Netherlands.

CoSCo is the Complex Systems Computation Research Group.

Peter Grünwald, CWI, Amsterdam

Petri Myllymäki, University of Helsinki, CoSCo

Page 2: Similarity Analysis by  Data Compression

Data Compression…

• Consider two files A and B• Let’s compress these with your favourite general-

purpose data compressor, e.g. gzip • Let L(A) and L(B) be the compressed length (in bits)

of A and B, respectively

Page 3: Similarity Analysis by  Data Compression

…and Similarity

• Suppose we want to compress both A and B.• We can either first compress A and then B

– Resulting length: L(A)+L(B)

• Or we can glue A and B together and compress the resulting file AB– Resulting length L(AB)

Page 4: Similarity Analysis by  Data Compression

…and Similarity

• Suppose we want to compress both A and B.• We can either first compress A and then B

– Resulting length: L(A)+L(B)

• Or we can glue A and B together and compress the resulting file AB– Resulting length L(AB)

CLAIM: if (and only if) A and B are ‘similar’, then

L(AB) << L(A) + L(B)

Page 5: Similarity Analysis by  Data Compression

“Domain-Independent”Notion of Similarity

• Consider same ASCII text in many different languages, e.g., Declaration of Human Rights– English close to German– English reasonable close to French– German farther from French– All three far from, say, Polish

• Consider DNA of different species– Human very close to Chimpanzee, somewhat less close to

Gorilla, even less close from Baboon…and very far from Wheat

• Consider MIDI-files of popular songs…

Page 6: Similarity Analysis by  Data Compression

Background

• For a given compressor with length function L, define Normalized Compression Distance as

• If L is taken to be Kolmogorov complexity, this becomes a “universal metric”– essentially, whenever two objects are close according to

some computable distance function, they will be close according to NCD as well

• For practical applications, use computationally practical general-purpose compressor– gzip, bzip, ppm etc.

Page 7: Similarity Analysis by  Data Compression

Applications

• For a set of N possibly related files, compute N2 pairwise normalized compression distances

• To visualize, create a binary tree such that close objects are close to each other on the tree– e.g. using quartet puzzling method

You can do this at home!

You cannot do this at home!

Page 8: Similarity Analysis by  Data Compression

Pump-Priming

Pre-Pump Priming:

– Theory developed and tested on several data sets at CWI;featured in New Scientist, Pour La Science, Izvestija…

– Successes include: SARS is CORONA

Pump Priming:

1. Development of popular Open-Source Package CompLearn (www.complearn.org, Rudi Cilibrasi)

2. Application of CompLearn and other compression-based methods to stemmatology

Page 9: Similarity Analysis by  Data Compression

Compression-Based Methods in Stemmatic Analysis

Legend of St. Henry of Finland, Manuscript H, Helsinki University Library

Page 10: Similarity Analysis by  Data Compression

Before Gutenberg...

• Historical manuscripts were repeatedly copied by hand

• Typical ’errors’ include misspellings, omissions, change of word order, etc....

Page 11: Similarity Analysis by  Data Compression

Manuscript Evolution

• The texts spread out in a number of copies, following a tree-like graph

• Typically only a fraction of the manuscripts remain to our date

Page 12: Similarity Analysis by  Data Compression

Stemmatic Analysis

• Stemmatology: ”Discipline that attempts to reconstruct the transmission of a text on the basis of relations between the various surviving manuscripts.”

• Cf. Phylogenetics: ”The study of evolutionary relatedness among various groups of organisms.”

manuscript individualwritten text DNA

copying reproductionmodification mutation

’contamination’ horizontal transfer

Page 13: Similarity Analysis by  Data Compression

Compression-Based Approach

• Most existing approaches (distance-based methods, parsimonial methods, Bayesian methods, etc.) based on methods developed for biological phylogeny:

• Pascal pump priming compression-based approach for stemmatic analysis

• Cost function: amount of information required to describe B given A.

Page 14: Similarity Analysis by  Data Compression

Constructing the stemma

• Dynamic programming for handling the missing nodes

• With 52 existing documents, the number of trees is about 2.7 x 1078 simulated annealing search

Page 15: Similarity Analysis by  Data Compression

How Does It Work?

• Actually, surprisingly well!

• In Helsinki, we have started a 2-year project with the historians, funded by the Emil Aaltonen Foundation, to study this approach further

Page 16: Similarity Analysis by  Data Compression

The Pascal Computer-Assisted Stemmatology Challenge

• Data set #1: Heinrichi data, collected specifically for this challenge

• Data set #2: The Parzival data - text is beginning of German poem Parzival by Wolfram von Eschenbach (translated to English by A.T. Hatto). Data kindly provided to us by M. Spencer and H. F. Windram

• Data set #3: Notre Besoin - text is from Stig Dagerman's, Notre besoin de consolation est impossible à rassasier, Paris: Actes Sud, 1952 (translated to French from Swedish by P. Bouquet). Data kindly provided to us by Caroline Macé.

Page 17: Similarity Analysis by  Data Compression

Challenge results

• No clear overall winner over all data sets• CompLearn performed very well in Parzival, but poorly in

Heinrichi, why? more research is required• Nice side result: the Heinrichi is internationally a quite

unique data set a platform for future collaboration with other sciences?

Page 18: Similarity Analysis by  Data Compression

Future work

• Analysis of Challenge results– New Challenge?

• Application to the Finnish Cultural Foundation to fund a two-year European research network on stemmatology– built around series of 4-5 international workshops gathering top

experts of the field.– names in application represent various disciplines including

historical studies, theology, philology, computer science, mathematics and biology

• Workshop on information-theoretic approaches to modeling in Helsinki?– July 2008, during ICML, UAI & COLT

Page 19: Similarity Analysis by  Data Compression