97
Redescription Mining An Introduction Esther Galbrun Department of Computer Science, Boston University December, 2014

Redescription Mining - An Introduction€¦ · Chile United Kingdom Russia China Mexico United States 4/55. Example Redescription Mining — Countries outside the Americas with land

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Redescription MiningAn Introduction

Esther Galbrun

Department of Computer Science, Boston University December, 2014

2009-2013

2014

Example

RedescriptionsAn Example with World countries

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

3/55

Example

RedescriptionsAn Example with World countries

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

4/55

Example

RedescriptionsAn Example with World countries

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

4/55

Example

RedescriptionsAn Example with World countries

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

4/55

Example

RedescriptionsAn Example with World countries

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

4/55

Example

RedescriptionsAn Example with World countries

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

4/55

Example

Redescription Mining

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Finding different waysto characterize the same things …

5/55

Example

Redescription Mining

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

…finding multiple thingsthat share common characterizations

5/55

Example

RedescriptionsAn Example with World countries

Let’s get a bit more specific…

6/55

Example

Geographic attributes

Canada

South Hemisphere

No

A Border to the Atlantic Ocean

Yes

I Border to the Indian Ocean

No

P Border to the Pacific Ocean

Yes

Continent

Americas

Land area

9985M km2

Highest elevation

5959 m

7/55

Example

Geographic attributesCanada

South Hemisphere NoA Border to the Atlantic Ocean YesI Border to the Indian Ocean NoP Border to the Pacific Ocean Yes

Continent AmericasLand area 9985M km2

Highest elevation 5959 m

A I P

7/55

Example

Geographic attributes

A I P

8/55

Example

Geographic descriptions

A I P

9/55

Example

Geographic descriptions

A

ACountries borderingthe Atlantic Ocean

9/55

Example

Geographic descriptions

≤ ≤Countries with highest elevationbetween 2400 and 5600 meters

9/55

Example

Geographic descriptions

A

¬ AND ACountries in the North hemisphere

bordering the Atlantic Ocean

9/55

Example

Geopolitical attributes

History of state communismHistory of colonialismPermanent member of the UNSCType of governmentPopulation

10/55

Example

Geopolitical attributes

11/55

Example

Geopolitical descriptions

12/55

Example

Geopolitical descriptions

Countries with a history of statecommunism

12/55

Example

Geopolitical descriptions

=

Monarchies

12/55

Example

Geopolitical descriptions

ANDCountries with a history of

colonialism members of UNSC

12/55

Example

Two views on the objects

A I P

13/55

Example

Redescriptions

— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism

Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States

14/55

Example

Redescriptions

6= AND ≤ AND

15/55

Example

Definitions

Redescription Given two datasets with identity between therows, a redescription is a pair of queries(qL,qR) over the columns characterizingapproximately the same sets of rows.

Redescription Mining Given such a pair of datasets and aset of constraints, find the best redescriptionssatisfying the constraints.

16/55

Example

Definitions

A I P

Dataset Two data matrices

Queries Logical formulaeAccuracy Jaccard coefficient

17/55

Example

Definitions

A I P Dataset Two data matricesQueries Logical formulae

Accuracy Jaccard coefficient

6= AND ≤

AND

17/55

Example

Definitions

A I PDataset Two data matricesQueries Logical formulae

Accuracy Jaccard coefficient

J(qL,qR) =|supp(qL)∩supp(qR)||supp(qL)∪supp(qR)|

17/55

Example

Definitions

A I P Dataset Two data matricesQueries Logical formulae

Accuracy Jaccard coefficientConstraints Support, accuracy,

length of the query,p-value, …

17/55

Example

Special Cases

A I P

? AND ? ⇒ ? AND ? AND ?

? AND ? ⇐ ? AND ? AND ?

Only conjunctive queries:bi-directionalassociation rules

18/55

Example

Special Cases

A I P? ? ? ⇒ ( AND ) OR

One query given: classificationtask

18/55

Example

Special Cases

A I P

(6= AND ≤ ,

AND)

{ , }( A AND ≤ ,= OR ≤ ≤

){ , , , , }

19/55

Algorithms for Redescription Mining

Exploration Strategies

How do we find redescriptions?

20/55

Algorithms for Redescription Mining

Exploration Strategies

6= AND ≤A AND ≤

AND= AND ≤ ≤=

A I P

21/55

Algorithms for Redescription Mining

Exploration Strategies

6= AND ≤A AND ≤

AND= AND ≤ ≤=

6= AND ≤ = AND ≤ ≤

A I P

21/55

Algorithms for Redescription Mining

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

22/55

Algorithms for Redescription Mining

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

23/55

Algorithms for Redescription Mining

Exploration Strategies

(¬ A AND ¬ I

)OR

(A AND P

)

A I P

24/55

Algorithms for Redescription Mining

Exploration Strategies

(¬ A AND ¬ I

)OR

(A AND P

)OR ¬

A I P

24/55

Algorithms for Redescription Mining

Exploration Strategies

OR ¬

A I P

24/55

Algorithms for Redescription Mining

Exploration Strategies

(= AND A

)OR 6=

OR ¬

A I P

24/55

Algorithms for Redescription Mining

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

25/55

Algorithms for Redescription Mining

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

26/55

Algorithms for Redescription Mining

Exploration Strategies

=

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= AND ≤

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR ≤

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= AND A

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR A

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= AND

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR ≤

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR ≤ OR

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR ≤ AND

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

= OR ≤ AND

A I P

27/55

Algorithms for Redescription Mining

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

28/55

Algorithms for Redescription Mining

Related work

Redescription mining for Boolean data

29/55

Algorithms for Redescription Mining

Related workGeopolitical Boolean attributes

30/55

Algorithms for Redescription Mining

Related work

Turning CARTwheels:An Alternating Algorithmfor Mining Redescriptions.

N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts,and R. F. Helm.In KDD, 2004.

Redescription Mining:Algorithms andApplications in Bioinformatics.

D. Kumar.PhD Thesis, Virginia Tech, 2007.

Query Languages

Propositional

CARTbranches

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

31/55

Algorithms for Redescription Mining

Related work

Redescription Mining:Structure Theory and Algorithms.

L. Parida and N. Ramakrishnan.In AAAI, 2005.

Query Languages

Monotoneconjunctive

Propositional

CARTbranches

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

32/55

Algorithms for Redescription Mining

Related work

Reasoning About SetsUsing Redescription Mining.

M. J. Zaki and N. Ramakrishnan.In KDD, 2005.

Query Languages

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

33/55

Algorithms for Redescription Mining

Related work

Finding SubgroupsHaving Several Descriptions:Algorithms for Redescription Mining.

A. Gallo, P. Miettinen, and H. Mannila.In SDM, 2008.

Query Languages

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

34/55

Algorithms for Redescription Mining

Related work

Finding SubgroupsHaving Several Descriptions:Algorithms for Redescription Mining.

A. Gallo, P. Miettinen, and H. Mannila.In SDM, 2008.

Query Languages

Linearlyparsable

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

34/55

Algorithms for Redescription Mining

Own work

Selecting a good set of redescriptions

35/55

Algorithms for Redescription Mining

Own work

“MDL for Redescription Mining”

with Matthijs van Leeuwen,Under review.

Query Languages

Linearlyparsable

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

36/55

Algorithms for Redescription Mining

Own work

Extending redescription mining to non-Boolean data

37/55

Algorithms for Redescription Mining

Own workGeopolitical Boolean attributes

38/55

Algorithms for Redescription Mining

Own workGeopolitical attributes

38/55

Algorithms for Redescription Mining

Related work

Finding SubgroupsHaving Several Descriptions:Algorithms for Redescription Mining.

A. Gallo, P. Miettinen, and H. Mannila.In SDM, 2008.

Query Languages

Linearlyparsable

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

39/55

Algorithms for Redescription Mining

Own work

From Black and White to Full Color:Extending Redescription MiningOutside the Boolean Worldwith Pauli Miettinen,In Statistical Analysis and Data Mining, 2012.

Query Languages

Linearlyparsable

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

40/55

Algorithms for Redescription Mining

Own work

Extending redescription mining to relational data

41/55

Algorithms for Redescription Mining

Own workGeopolitical attributes

42/55

Algorithms for Redescription Mining

Own workGeopolitical attributes and relations

French

Russian

English

Portuguese

Spanish

Chinese

Arabic

London

Bruxelles

Washington

New-York

Beijing

NATO

UN Security Council

EU

Commonwealth

OAS

member

membermember

language

language

language

language

language

language

language

language

language

language

capital

capital

capital

locatedin

located in

locatedin headquarters

headquarters

headquarters

headquarters

language

language

language

language

language

language

language

language

language

language

language

language

member

member

membermember

membermember

member

member

member

member

membermember

member

member

member

language

headquarters

locatedin

42/55

Algorithms for Redescription Mining

Own work

Finding Relational Redescriptions

with Angelika Kimmig,In Machine Learning, 2013.

Query Languages

Relational

Graphicalbinary

Linearlyparsable

Monotoneconjunctive

Propositional

CARTbranches

Conditionalrules

Exploration Strategies

Mine and pair/split

Atomic updates

Alternating scheme

HeuristicExhaustive

43/55

Example Redescriptions

Example Redescriptions

Computer science bibliography

Political candidates profilesBioclimatic nichesEthnology

Researchers with multiple publications in SoCG and CCCGconferences often collaborate with Profs M. Overmars orE. D. Demaine.

44/55

Example Redescriptions

Example Redescriptions

Computer science bibliographyPolitical candidates profiles

Bioclimatic nichesEthnology

Candidates to the 2011 Finnish parliamentary electionbelow age sixty accord little importance to the question ofpension indices.

44/55

Example Redescriptions

Example Redescriptions

Computer science bibliographyPolitical candidates profilesBioclimatic niches

Ethnology

Scandinavia and Baltia, which are characterized by theirspecific cold climate, are the habitat of the European Elk.

44/55

Example Redescriptions

Bioclimatic Niche Finding

Dataset: Spatial land areas of Europe (2575 entities)Presence/absence of mammals

(194 species)Climatic data

(48 temperature and rainfall variables)Question: Find a query over climatic variables that

describes the area inhabited by (a group of)mammal species (and vice versa)

45/55

Example Redescriptions

Bioclimatic Niche Finding

European Elk

([−9.80≤ tmaxFeb ≤ 0.40]∧ [12.20≤ tmax

Jul ≤ 24.60]∧[56.852≤ pavg

Aug ≤ 136.46])∨ [183.27≤ pavgSep ≤ 238.78]

J = 0.814 supp = 582 46/55

Example Redescriptions

Bioclimatic Niche Finding

Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew

([3.20≤ tmaxMar ≤ 14.50]∧ [17.30≤ tmax

Aug ≤ 25.20]∧[14.90≤ tmax

Sep ≤ 22.80])∨ [19.60≤ tavgJul ≤ 19.956]

J = 0.623 supp = 681 47/55

Example Redescriptions

Example Redescriptions

Computer science bibliographyPolitical candidates profilesBioclimatic niches

Ethnology

Scandinavia and Baltia, which are characterized by theirspecific cold climate, are the habitat of the European Elk.

48/55

Example Redescriptions

Example Redescriptions

Computer science bibliographyPolitical candidates profilesBioclimatic nichesEthnology

Among Alyawarra, “Aleriya” refers to the son of a malespeaker or to the child of the speaker’s brother.

48/55

Example Redescriptions

Elicit Kinship Terminology

Dataset: Ethnographic information about AustralianAlyawarra tribe

Kinship terminologyGenealogic, age and sex informations

Question: Elicit the meaning of kinship terms

49/55

Example Redescriptions

Elicit Kinship Terminology

50/55

Example Redescriptions

Elicit Kinship Terminology

Dataset: Ethnographic information about AustralianAlyawarra tribe

Kinship terminologyGenealogic, age and sex informations

Question: Elicit the meaning of kinship terms

51/55

Example Redescriptions

Elicit Kinship Terminology

Aleriya#A #Z

#Z

male

#A #1spouse

parent

age <

female

#1

#2

#Z

#A

male

male

parent

parentparent

age <

g16.1

k16 g16.2

Aleriya is used to refer to one’s father or one’s brother’s child

52/55

The SIREN interface

The SIREN interface

Visualizing and interactively mining redescriptions

with Pauli Miettinen,In Instant Interactive Data Mining Workshop at ECML/PKDD, 2012.Demo at KDD 2012 and SIGMOD 2014.

53/55

The SIREN interface

The SIREN interface

54/55

The SIREN interface

The SIREN interface

54/55

The SIREN interface

The SIREN interface

54/55

The SIREN interface

The SIREN interface

54/55

The SIREN interface

The SIREN interface

54/55

The SIREN interface

The SIREN interface

54/55

Conclusion

Conclusion

Redescription Mining is a versatile andpowerful data-mining tool, applicable in various domains.

For more details:

[email protected]

http://www.cs.helsinki.fi/u/galbrun/

Thank you!

55/55

Conclusion

Conclusion

Redescription Mining is a versatile andpowerful data-mining tool, applicable in various domains.

For more details:

[email protected]

http://www.cs.helsinki.fi/u/galbrun/

Thank you!

55/55