Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Redescription MiningAn Introduction
Esther Galbrun
Department of Computer Science, Boston University December, 2014
Example
RedescriptionsAn Example with World countries
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
3/55
Example
RedescriptionsAn Example with World countries
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
4/55
Example
RedescriptionsAn Example with World countries
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
4/55
Example
RedescriptionsAn Example with World countries
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
4/55
Example
RedescriptionsAn Example with World countries
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
4/55
Example
RedescriptionsAn Example with World countries
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
4/55
Example
Redescription Mining
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Finding different waysto characterize the same things …
5/55
Example
Redescription Mining
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
…finding multiple thingsthat share common characterizations
5/55
Example
Geographic attributes
Canada
South Hemisphere
No
A Border to the Atlantic Ocean
Yes
I Border to the Indian Ocean
No
P Border to the Pacific Ocean
Yes
Continent
Americas
Land area
9985M km2
Highest elevation
5959 m
7/55
Example
Geographic attributesCanada
South Hemisphere NoA Border to the Atlantic Ocean YesI Border to the Indian Ocean NoP Border to the Pacific Ocean Yes
Continent AmericasLand area 9985M km2
Highest elevation 5959 m
A I P
7/55
Example
Geographic descriptions
≤ ≤Countries with highest elevationbetween 2400 and 5600 meters
9/55
Example
Geographic descriptions
A
¬ AND ACountries in the North hemisphere
bordering the Atlantic Ocean
9/55
Example
Geopolitical attributes
History of state communismHistory of colonialismPermanent member of the UNSCType of governmentPopulation
10/55
Example
Redescriptions
— Countries outside the Americas with land area above8 billion square kilometers— Permanent members of the UN Security Council with ahistory of state communism
Canada France MozambiqueChile United Kingdom RussiaChina Mexico United States
14/55
Example
Definitions
Redescription Given two datasets with identity between therows, a redescription is a pair of queries(qL,qR) over the columns characterizingapproximately the same sets of rows.
Redescription Mining Given such a pair of datasets and aset of constraints, find the best redescriptionssatisfying the constraints.
16/55
Example
Definitions
A I P
Dataset Two data matrices
Queries Logical formulaeAccuracy Jaccard coefficient
17/55
Example
Definitions
A I P Dataset Two data matricesQueries Logical formulae
Accuracy Jaccard coefficient
6= AND ≤
AND
17/55
Example
Definitions
A I PDataset Two data matricesQueries Logical formulae
Accuracy Jaccard coefficient
J(qL,qR) =|supp(qL)∩supp(qR)||supp(qL)∪supp(qR)|
17/55
Example
Definitions
A I P Dataset Two data matricesQueries Logical formulae
Accuracy Jaccard coefficientConstraints Support, accuracy,
length of the query,p-value, …
17/55
Example
Special Cases
A I P
? AND ? ⇒ ? AND ? AND ?
? AND ? ⇐ ? AND ? AND ?
Only conjunctive queries:bi-directionalassociation rules
18/55
Algorithms for Redescription Mining
Exploration Strategies
6= AND ≤A AND ≤
…
AND= AND ≤ ≤=
…
A I P
21/55
Algorithms for Redescription Mining
Exploration Strategies
6= AND ≤A AND ≤
…
AND= AND ≤ ≤=
…
6= AND ≤ = AND ≤ ≤
A I P
21/55
Algorithms for Redescription Mining
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
22/55
Algorithms for Redescription Mining
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
23/55
Algorithms for Redescription Mining
Exploration Strategies
(¬ A AND ¬ I
)OR
(A AND P
)OR ¬
A I P
24/55
Algorithms for Redescription Mining
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
25/55
Algorithms for Redescription Mining
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
26/55
Algorithms for Redescription Mining
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
28/55
Algorithms for Redescription Mining
Related work
Turning CARTwheels:An Alternating Algorithmfor Mining Redescriptions.
N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts,and R. F. Helm.In KDD, 2004.
Redescription Mining:Algorithms andApplications in Bioinformatics.
D. Kumar.PhD Thesis, Virginia Tech, 2007.
Query Languages
Propositional
CARTbranches
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
31/55
Algorithms for Redescription Mining
Related work
Redescription Mining:Structure Theory and Algorithms.
L. Parida and N. Ramakrishnan.In AAAI, 2005.
Query Languages
Monotoneconjunctive
Propositional
CARTbranches
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
32/55
Algorithms for Redescription Mining
Related work
Reasoning About SetsUsing Redescription Mining.
M. J. Zaki and N. Ramakrishnan.In KDD, 2005.
Query Languages
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
33/55
Algorithms for Redescription Mining
Related work
Finding SubgroupsHaving Several Descriptions:Algorithms for Redescription Mining.
A. Gallo, P. Miettinen, and H. Mannila.In SDM, 2008.
Query Languages
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
34/55
Algorithms for Redescription Mining
Related work
Finding SubgroupsHaving Several Descriptions:Algorithms for Redescription Mining.
A. Gallo, P. Miettinen, and H. Mannila.In SDM, 2008.
Query Languages
Linearlyparsable
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
34/55
Algorithms for Redescription Mining
Own work
“MDL for Redescription Mining”
with Matthijs van Leeuwen,Under review.
Query Languages
Linearlyparsable
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
36/55
Algorithms for Redescription Mining
Own work
Extending redescription mining to non-Boolean data
37/55
Algorithms for Redescription Mining
Related work
Finding SubgroupsHaving Several Descriptions:Algorithms for Redescription Mining.
A. Gallo, P. Miettinen, and H. Mannila.In SDM, 2008.
Query Languages
Linearlyparsable
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
39/55
Algorithms for Redescription Mining
Own work
From Black and White to Full Color:Extending Redescription MiningOutside the Boolean Worldwith Pauli Miettinen,In Statistical Analysis and Data Mining, 2012.
Query Languages
Linearlyparsable
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
40/55
Algorithms for Redescription Mining
Own work
Extending redescription mining to relational data
41/55
Algorithms for Redescription Mining
Own workGeopolitical attributes and relations
French
Russian
English
Portuguese
Spanish
Chinese
Arabic
London
Bruxelles
Washington
New-York
Beijing
NATO
UN Security Council
EU
Commonwealth
OAS
member
membermember
language
language
language
language
language
language
language
language
language
language
capital
capital
capital
locatedin
located in
locatedin headquarters
headquarters
headquarters
headquarters
language
language
language
language
language
language
language
language
language
language
language
language
member
member
membermember
membermember
member
member
member
member
membermember
member
member
member
language
headquarters
locatedin
42/55
Algorithms for Redescription Mining
Own work
Finding Relational Redescriptions
with Angelika Kimmig,In Machine Learning, 2013.
Query Languages
Relational
Graphicalbinary
Linearlyparsable
Monotoneconjunctive
Propositional
CARTbranches
Conditionalrules
Exploration Strategies
Mine and pair/split
Atomic updates
Alternating scheme
HeuristicExhaustive
43/55
Example Redescriptions
Example Redescriptions
Computer science bibliography
Political candidates profilesBioclimatic nichesEthnology
Researchers with multiple publications in SoCG and CCCGconferences often collaborate with Profs M. Overmars orE. D. Demaine.
44/55
Example Redescriptions
Example Redescriptions
Computer science bibliographyPolitical candidates profiles
Bioclimatic nichesEthnology
Candidates to the 2011 Finnish parliamentary electionbelow age sixty accord little importance to the question ofpension indices.
44/55
Example Redescriptions
Example Redescriptions
Computer science bibliographyPolitical candidates profilesBioclimatic niches
Ethnology
Scandinavia and Baltia, which are characterized by theirspecific cold climate, are the habitat of the European Elk.
44/55
Example Redescriptions
Bioclimatic Niche Finding
Dataset: Spatial land areas of Europe (2575 entities)Presence/absence of mammals
(194 species)Climatic data
(48 temperature and rainfall variables)Question: Find a query over climatic variables that
describes the area inhabited by (a group of)mammal species (and vice versa)
45/55
Example Redescriptions
Bioclimatic Niche Finding
European Elk
([−9.80≤ tmaxFeb ≤ 0.40]∧ [12.20≤ tmax
Jul ≤ 24.60]∧[56.852≤ pavg
Aug ≤ 136.46])∨ [183.27≤ pavgSep ≤ 238.78]
J = 0.814 supp = 582 46/55
Example Redescriptions
Bioclimatic Niche Finding
Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew
([3.20≤ tmaxMar ≤ 14.50]∧ [17.30≤ tmax
Aug ≤ 25.20]∧[14.90≤ tmax
Sep ≤ 22.80])∨ [19.60≤ tavgJul ≤ 19.956]
J = 0.623 supp = 681 47/55
Example Redescriptions
Example Redescriptions
Computer science bibliographyPolitical candidates profilesBioclimatic niches
Ethnology
Scandinavia and Baltia, which are characterized by theirspecific cold climate, are the habitat of the European Elk.
48/55
Example Redescriptions
Example Redescriptions
Computer science bibliographyPolitical candidates profilesBioclimatic nichesEthnology
Among Alyawarra, “Aleriya” refers to the son of a malespeaker or to the child of the speaker’s brother.
48/55
Example Redescriptions
Elicit Kinship Terminology
Dataset: Ethnographic information about AustralianAlyawarra tribe
Kinship terminologyGenealogic, age and sex informations
Question: Elicit the meaning of kinship terms
49/55
Example Redescriptions
Elicit Kinship Terminology
Dataset: Ethnographic information about AustralianAlyawarra tribe
Kinship terminologyGenealogic, age and sex informations
Question: Elicit the meaning of kinship terms
51/55
Example Redescriptions
Elicit Kinship Terminology
Aleriya#A #Z
#Z
male
#A #1spouse
parent
age <
female
#1
#2
#Z
#A
male
male
parent
parentparent
age <
g16.1
k16 g16.2
Aleriya is used to refer to one’s father or one’s brother’s child
52/55
The SIREN interface
The SIREN interface
Visualizing and interactively mining redescriptions
with Pauli Miettinen,In Instant Interactive Data Mining Workshop at ECML/PKDD, 2012.Demo at KDD 2012 and SIGMOD 2014.
53/55
Conclusion
Conclusion
Redescription Mining is a versatile andpowerful data-mining tool, applicable in various domains.
For more details:
http://www.cs.helsinki.fi/u/galbrun/
Thank you!
55/55
Conclusion
Conclusion
Redescription Mining is a versatile andpowerful data-mining tool, applicable in various domains.
For more details:
http://www.cs.helsinki.fi/u/galbrun/
Thank you!
55/55