36
1 Current Research Current Research in Data Mining in Data Mining Research Group Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing October 27, 2022

Current Research in Data Mining Research Group

Embed Size (px)

DESCRIPTION

Current Research in Data Mining Research Group. Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing November 8, 2014. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Current Research in Data Mining Research Group

1

Current Research in Data Current Research in Data Mining Research GroupMining Research Group

Jiawei HanData Mining Research Group

Department of Computer Science

University of Illinois at Urbana-ChampaignAcknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing

April 20, 2023

Page 2: Current Research in Data Mining Research Group

2

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 3: Current Research in Data Mining Research Group

Data Mining and Data WarehousingData Mining and Data WarehousingJiawei Han’s Group at CS, Jiawei Han’s Group at CS, UIUCUIUC

Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining

Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus , RankClus, and NetClus

600+ research papers in conferences and journals Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W.

McDowell Award, Daniel Drucker Eminent Faculty Award Textbook, “Data mining: Concepts and Techniques,” adopted

worldwide Project lead for NASA EventCube for Aviation Safety [2008-2012] Director of Information Network Academic Research Center funded

from Army Research Lab (ARL) [2009-2014]3

Page 4: Current Research in Data Mining Research Group

Data Mining Research Group at CS, UIUC

4

Page 5: Current Research in Data Mining Research Group

New Books on Data Mining & Link MiningNew Books on Data Mining & Link Mining

5

Han, Kamber and Pei,Data Mining, 3rd ed. 2011

Yu, Han and Faloutsos (eds.), Link Mining, 2010

Sun and Han, Mining Heterogeneous

Information Networks, 2012

Page 6: Current Research in Data Mining Research Group

6

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 7: Current Research in Data Mining Research Group

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

RankClus/NetClus

SIGMOD

SDM

ICDM

KDD

EDBT

VLDB

ICML

AAAI

Tom

Jim

Lucy

Mike

Jack

Tracy

Cindy

Bob

Mary

Alice

SIGMOD

VLDB

EDBT

KDDICDM

SDM

AAAI

ICML

Objects

Ra

nki

ng

RankCompete: A Competing Random Walk Model for Rank-Based Clustering

Database Data Mining AI IR

Top-5 ranked

conferences

VLDB KDD IJCAI SIGIR

SIGMOD SDM AAAI ECIR

ICDE ICDM ICML CIKM

PODS PKDD CVPR WWW

EDBT PAKDD ECML WSDM

Top-5 ranked terms

data mining learning retrieval

database data knowledge information

query clustering reasoning web

system classification logic search

xml frequent cognition text

RankClass [KDD11]

Knowledge Propagation in Heterogeneous Network

Page 8: Current Research in Data Mining Research Group

8

Similarity Search and Role Discovery in Similarity Search and Role Discovery in Information NetworksInformation Networks

Path: ITI Path: ITIGITI

Which images are most similar to me in Flickr?PathSim [VLDB11]

Meta Path-Guided Similarity Search in

Networks

A “dirty” Information Network (imaginary)

Cleaned/InferredAdversarial Network

Cleaned/InferredAdversarial Network

Chief

Insurgent

Cell Lead

Automatically infer

Role Discovery in Information Networks [KDD’10]

Advisee Top Ranked Advisor

Time Note

David M. Blei

1. Michael I. Jordan

01-03 PhD advisor, 2004

2. John D. Lafferty

05-06 Postdoc, 2006

Hong Cheng

1. Qiang Yang 02-03 MS advisor, 2003

2. Jiawei Han 04-08 PhD advisor, 2008

Sergey Brin

1. Rajeev Motawani 97-98 Unofficial advisor

Page 9: Current Research in Data Mining Research Group

Meta-Paths & Their Prediction PowerMeta-Paths & Their Prediction Power List all the meta-paths in bibliographic network up to length 4

Investigate their respective power for coauthor relationship prediction Which meta-path has more prediction power? How to combine them to achieve the best quality of prediction

9

Page 10: Current Research in Data Mining Research Group

Relationship Prediction in Heterogeneous Info NetworksRelationship Prediction in Heterogeneous Info Networks

Why Prediction of Co-Author Relationship in DBLP? Prediction of relationships between different types of nodes

in heterogeneous networks E.g., what papers should Faloutsos writes?

Traditional link prediction: homogeneous networks Co-author networks in DBLP, friendship networks in Facebook

Relationship prediction Study the roles of topological features in heterogeneous

networks in predicting the co-author relationship building Meta-path guided prediction!

Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July 2011

10

Page 11: Current Research in Data Mining Research Group

Guidance: Meta Path in Bibliographic NetworkGuidance: Meta Path in Bibliographic Network

Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar

semantics and are comparable and inferable

11

papertopic

venue

author

publish publish-1

mention-1

mention writewrite-1

contain/contain-1 cite/cite-1

Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A)

Page 12: Current Research in Data Mining Research Group

Case Study in CS Bibliographic NetworkCase Study in CS Bibliographic Network The learned significance for each meta path under measure “normalized

path count” for HP-3hop dataset

12

Page 13: Current Research in Data Mining Research Group

Case Study: Predicting Concrete Co-AuthorsCase Study: Predicting Concrete Co-Authors High quality predictive power for such a difficult task

13

Using data in T0 =[1989; 1995] and T1 = [1996; 2002]

Predict new coauthor relationship in T2 = [2003; 2009]

Page 14: Current Research in Data Mining Research Group

14

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 15: Current Research in Data Mining Research Group

Structural Layer: follow the sametopology as the document network

iTopicModel: Model Set-Up & Objective FunctioniTopicModel: Model Set-Up & Objective Function

Graphical model: ϴi=(ϴi1, ϴi2,…, ϴiT): Topic distribution for document xi

Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then

pick a word w~multi(βz)

Objective function: joint probability X: observed text informationG: document networkParameters

ϴ: topic distributionβ: word distribution

ϴ is the most critical, need to be consistent with the text as well as the network structure

Structure part Text partCan model them separately!

Page 16: Current Research in Data Mining Research Group

Case Study: Topic Hierarchy Building for DBLPCase Study: Topic Hierarchy Building for DBLP

Page 17: Current Research in Data Mining Research Group

Probabilistic Topic Models with Network-Based Probabilistic Topic Models with Network-Based Biased PropagationBiased Propagation

Text-rich heterogeneous information network Ubiquitous textual documents (news, papers) Connect with users and other objects: Topic propagation

Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11

17

How to discover latent topics and identify clusters of multi-typed objects simultaneously?

How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks?

Page 18: Current Research in Data Mining Research Group

Biased Topic PropagationBiased Topic PropagationIntuition: InfoNet provides valuable informationDifferent objects have their own inherent information (e.g., D with rich text and U without explicit text) To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D

18

Basic Criterion: (Biased Topic Propagation) The topic of an object without explicit text depends on the topic of the

documents it connects The topic of a document is correlated with its objects to some extend, and

should be principally determined by its inherent content of the text A simple and unbiased topic propagation does not make much sense

Page 19: Current Research in Data Mining Research Group

Incorporating Heterogeneous Info. NetworkIncorporating Heterogeneous Info. Network

19

L(C): Topic modelR(G): Biased propagation

Page 20: Current Research in Data Mining Research Group

Experiments: DBLP & NSF AwardsExperiments: DBLP & NSF Awards Data Collection

DBLP NSF-Awards

Metrics Accuracy (AC) Normalized mutual information (NMI)

Results

20

Page 21: Current Research in Data Mining Research Group

21

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 22: Current Research in Data Mining Research Group

Event Cube:Event Cube: An Overview An Overview

MultidimensionalText Database

98.0199.0299.01

98.02

LAX SJC MIA AUS

overshoot

undershootbirds

turbulence

Time

Location

Topic

CA FL TXLocatio

n

1998

1999

Time

Deviation

Encounter

Topic

drill-down

roll-up

Event CubeRepresentation

Analyst…Multidimensional OLAP, Ranking, Cause Analysis,

Topic Summarization/Comparison …… Analysis Support

22 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation EventsEvent Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events

Funded by NASA (2008-2010)

Page 23: Current Research in Data Mining Research Group

Text/Topic Cube: General Idea

Heterogeneous: categorical attributes + unstructured text

How to combine? Our solution:

Time Location Place Environment … … Event ReportACN

Text data

Cube: Categorical Attributes

Term/Topic Weight

T1 W1

T2 W2

T3 W3

… …

Text/Topic Model: Unstructured TextMeasure

Page 24: Current Research in Data Mining Research Group

24

Effective Keyword Search TopCells (ICDE’ 10): Ranking aggregated cells (objects) in

TextCube.

HealthcareReform

Page 25: Current Research in Data Mining Research Group

Effective OLAP Exploration TEXplorer (submitted): Integrating keyword-based ranking

and OLAP exploration

25

HealthcareReform

Page 26: Current Research in Data Mining Research Group

Effective Event Tracking PET (KDD’ 10): tracking popularity and textual representation

of events in social communities (twitter)

26

debate,cost,senate,…

pass,success,law,…

HealthcareReform

benefit,profit,effective,…

Page 27: Current Research in Data Mining Research Group

27

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 28: Current Research in Data Mining Research Group

Growing Parallel Paths Growing Parallel Paths (WWW 2011)(WWW 2011)

DIV UL

AB

AC

HTML DIV UL

LI

LI

AX

AY

HTML DIV UL

LI

LI

AZ

AW

TABLE TR

TD

TD AU

AV

HTML

HTML

LI

LI

DIV

DIV ...

...

Page A

Page D

Page E

Page F

DIV P AFHTML

Page C

DIV

P

AE

Page B

HTML

P

AD

1

2

3

4

5

6

X

Y

Z

W

U

V

Path

Result:

28

Page 29: Current Research in Data Mining Research Group

Mapping Pages to Records Mapping Pages to Records (CIKM’10)(CIKM’10)

/people

/people/faculty

/jiawei-han

/people/faculty

/dan-roth

/people/faculty/vikram-

adve

/research/research

/areas/data

Faculty

DataMining

Jiawei Han

Dan Roth

Vikram Adve

Jiawei Han

Dan Roth

People

/people/faculty

www.cs.illinois.edu/homes/hanj/

llvm.cs.uiuc.edu/~vadve/Home.html

l2r.cs.uiuc.edu/~danr/

Research

PersonalSite

PersonalSite

PersonalSite

/ (root) [cs.illinois.edu]

llvm.cs.uiuc.edu/~vadve/Home.html

rsim.cs.illinois.edu/~sadve/

www.cs.illinois.edu/homes/hanj/

l2r.cs.uiuc.edu/~danr/

Tarek AbdelzaherSarita AdveVikram Adve

Gul AghaEyal AmirDan Roth

Jiawei Han

--------------

Name URL

Structured Data Web PagesMappings

--------------

Zipcode

Database records can be found on link paths!

29

Page 30: Current Research in Data Mining Research Group

WinaCS: Web Information Network Analysis WinaCS: Web Information Network Analysis for Computer Sciencefor Computer Science

Integration of Web structure mining and information network analysis

Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011.

30

Page 31: Current Research in Data Mining Research Group

31

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 32: Current Research in Data Mining Research Group

32

Discovery of Swarms and Periodic Patterns in Moving Discovery of Swarms and Periodic Patterns in Moving Object DataObject Data

A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo)

Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub)

Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub)

← Bird flying paths shown on Google Earth

Mined periodic patterns by our new method →

← Convoy discovers only restricted patterns

Swarm discovers more patterns →

Page 33: Current Research in Data Mining Research Group

GeoTopic Discovery: Mining Spatial TextGeoTopic Discovery: Mining Spatial Text

LDM

TDM

GeoFolk

LGTA

Geo-tagged photos w. landscape (coast vs. desert vs. mountain)

33

Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11

Page 34: Current Research in Data Mining Research Group

34

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Page 35: Current Research in Data Mining Research Group

35

Conclusions: Conclusions: Towards Mining Data Semantics in Integrated Towards Mining Data Semantics in Integrated Heterog. NetworksHeterog. Networks

Most data objects are linked, forming heterogeneous information networks Most datasets can be “organized” or “transformed” into

“structured” multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Structures can be progressively mined from less organized

data sets by info. network analysis Surprisingly rich knowledge can be mine from such structured

heterogeneous info. networks Clustering, ranking, classification, data cleaning, trust analysis,

role discovery, similarity search, relationship prediction, …… It is promising to mine data semantics from rich info. networks !

Page 36: Current Research in Data Mining Research Group

References for the TalkReferences for the Talk J. Han, Y. Sun, X. Yan, and . S. Yu, “Mining Heterogeneous Information Networks"

(tutorial), KDD'10. Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of

Heterogeneous Information Networks", KDD'11. Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous

Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information

Networks with Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity

Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction

in Heterogeneous Bibliographic Networks", ASONAM'11 C. Wang, J. Han, et al.,, , “Mining Advisor-Advisee Relationships from Research

Publication Networks", KDD'10. Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-

Based Computer Science Information Networks", ACM SIGMOD'11 (system demo) X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information

Providers on the Web”, IEEE TKDE, 20(6), 200836