86
iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor [email protected] 1

Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor [email protected] 1

Embed Size (px)

Citation preview

Page 1: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

iMinds Dept. MEDICAL IT

KU Leuven ESAT-STADIUS

Serious Data Mining

Prof.Dr. Bart De Moor

[email protected]

Page 2: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 2

Page 3: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

1 million = 1 000 0001 billion = 1 000 000 0001 trillion = 1 000 000 000 0001 quadrillion = 1 000 000 000 000 000

1 kB = 1 000 1 MB = 1 000 0001 GB = 1 000 000 0001 TB = 1 000 000 000 0001 PB = 1 000 000 000 000 000

1 TB = large university library= 212 DVD discs = 1430 CDs= 3 year music in CD quality

3

Page 4: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

4

Page 5: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

5

Page 6: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

6

Page 7: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

7

Page 8: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Moore’s law:

computing power

doubles

every 18 months

Next GenerationSequencing

Carlson’s law:

complexity/cost

evolves

exponentially

8

Page 9: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Genome data

9

• Human genome project (2003)

– 13 year project

– $300 million value with 2002 technology

• Personal genome (2007)

– Genome of James Watson, 2 months

– $1 000 000

• €1000-genome

– Expected 2012-2020

1,00E-07

1,00E-06

1,00E-05

1,00E-04

1,00E-03

1,00E-02

1,00E-01

1,00E+00

1,00E+01

1,00E+02

1,00E+03

1,00E+04

1,00E+05

1,00E+06

1,00E+07

1,00E+08

1,00E+09

1,00E+10

1,00E+11

1990 1995 2000 2002 2005 2007 2010 2015

Cost per base pair

Genome cost

GS-FLX Roche Applied Science 454

Sequencers

Page 10: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Tsunami of medical data

PACS

UZ Leuven

1,6 PetaByte

Genomics core

HiSeq 2000 full

speed exome

sequencing

1 TeraByte / week

1 small

animal

image

1

GigaByte1 CD-ROM

750

MegaByte

sequencing all newborns

by 2020 (125k births /

year)

125 PetaByte / year

index of 20

million

Biomedical

PubMed

records

23 GigaByte

1 slice mouse

brain MSI at

10 μm

resolution

81 GigaByte

raw NGS data

of 1 full genome

1 TeraByte

10

Page 11: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

11

Page 12: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Data explosion in finance

Growing ~ 30-50% every year, half of this is unstructured!

Data storage became 30% cheaper, yet budgets for data storage are still rising.

12

Page 13: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

How big is big?

Google

?

Microsoft

1 million servers (2013)

Amazon

?

13

Page 14: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

14

Page 15: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

15

Page 16: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

16

Page 17: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 17

Page 18: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Big Data

18

Page 19: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Data

19

Page 20: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Compute infrastructure

20

Page 21: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Storage

21

Page 22: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

22

Page 23: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

AnalyticsFlowchart

23

Page 24: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Visualization

24

Page 25: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Security

25

Page 26: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 26

Page 27: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Big Data Landscape

More and moreanalytics as a commodity!

27

Page 28: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Machine Learning as a commodity

28

Page 29: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Big Data LandscapeMany possible applications!

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

Focus onSerious Big Data

29

Page 30: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 30

Page 31: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Analytics

Big Data Analytics

Numerical algorithms Rule-based

31

Page 32: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Main tasks

Prediction Segmentation Anomalies

Regression Clustering

Classification

Outlier

32

Page 33: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

What can we do?

Shopping cart analysis

Fraud detection

Face recognition

Movie reccomendation

Just-in-time production

Credit worthiness

Disease spreadingTraffic management

33

Page 34: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

black blond

orange

blue

long

short

Hair color

length

Color clothes

FeauturesClustersSimilarityDecision

Clustering/Classification

34

Page 35: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Forecasting

35

Page 36: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Deep Learning● Neural networks.● New algorithms.● Multiple layers on

top of each other.● Each layer learns a

more complex representation.

● Learn feature hierarchies.

Page 37: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Stadius - Books

37

Page 38: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Stadius - Spin-offs

38

Page 39: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

DATA

ANALYTICS

Supervised learning Unsupervised learning Statistical methodsOther analytical

techniques

General

methodology: Dusan

Popovic, Marc

Claesen, Jack Simm,

Peter Roelants

Esemble Methods :

Dusan Popovic, Marc

Claesen, Jaak Simm

Semi-supervised

learning : Marc

Claesen

Large-scale learning :

Marc Claesen, Nico

Verbeeck

Hyperparameter

optimization : Marc

Claesen, Dusan

Popovic

Optimization meta-

heuristics : Dusan

Popovic, Maira

Rodriguez

Neural Networks :

Peter Roelants, Dusan

Popovic

Deep learning : Peter

Roelants

ICA : Nico Verbeeck

Clustering : Yousef el

Alamaat

Matrix factorization :

Jaak Simm

Manifold learning :

Xian Mao

Decision Trees :

Dusan Popovic, Jaak

Simm

Regularized

regression & GLM :

Yousef el Alamaat,

Arnaud Installe

Wavelet compression

:

Nico Verbeeck

Visualization : Toni

Verbeiren, Ryo Sakai

Multi-task learning :

Jaak Simm

Data fusion : Dusan

Popovic, Pooya Zakeri

, Gorana Nikolic

General

methodology:

Yousef el Alamaat,

Dusan Popovic, Marc

Claesen

SVMs & Kernel

Methods : Marc

Claesen, Jaak Simm,

Pooya Zakeri

Survival analysis :

Marc Claesen

Algorithms in Stadius

39

Page 40: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Big DataApplications

Large-scalescience

Bio-Informatics

Data Assimilation

Health Smart City

Electricity

Internet of Things

Traffic

Finance

Risk Assesment

Stock Trading

Customer Relation

Management

Fraud Detection

Churn

Text Mining

Stadius

40

Page 41: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

BIOMEDICAL APPLICATIONS

Diseases/Bodily

processes/Tissues

Technological platforms/data

types

Cancer : Dusan Popovic, Xian

Mao

Diabetes : Marc Claesen

Endometriosis : Yousef el

Alamat, Dusan Popovic, Nico

Verbeek

Mass Spectrometry: Nico

Verbeek, Yousef el Alamat, Xian

Mao

Genomics : Inge Thijs, Dusan

Popovic, Maira Rodriguez, Ryo

Sakai

Proteomics : Inge Thijs, Yousef

el Alamat, Xian Mao, Ryo Sakai,

Pooya Zakeri

Microarrays : Dusan Popovic,

Griet Laenen, Pooya Zakeri

miRNA : Jaak Simm

RNA sequencing : Jaak Simm

NGS : Amin Ardeshirdavani, Jaak

Simm

Metagenomics : Jaak Simm

Rare genetic disorders : Dusan

Popovic, Ryo Sakai

IT support

Object-oriented programming

: Arnaud Installe, Amin

Ardeshirdavani, Gorana Nikolic,

Marc Claesen, Toni

Verbeiren,Xian Mao

Web applications : Amin

Ardeshirdavani, Gorana Nikolic,

Toni Verbeiren

Databases : Arnaud Installe,

Amin Ardeshirdavani, Gorana

Nikolic

Big data technologies (ex.

Hadoop , MapReduce) : Amin

Ardeshirdavani, Jaak Simm, Toni

Verbeiren

Distributed computing : Toni

Verbeiren

Glucose level monitoring : Tom

Van Herpe

Mouse brain : Nico Verbeek,

Yousef el Alamat, Xian Mao

High-performance computing :

Jaak Simm

Bacteria&Fungi : Jaak SimmAgent-based systems : Maira

Rodriguez

Clinical data : Arnaud Installe,

Dusan Popovic, Yousef el Alamat

Annotation data : Pooya Zakeri

Biological networks : Griet

Laenen, Pooya Zakeri

Drug discovery : Griet Laenen,

Pooya Zakeri

41

Page 42: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

42

Page 43: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Power grid

43

Page 44: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Electric load forecastingProblem

Energy Production

Energy Demand

How to forecast the demand?

44

Page 45: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Electric load forecasting

Data sets

Elia KMI

Transmission system operator

Hourly data

10 substationsconsidered

Meteorological Institute

Weatherconditions

45

Page 46: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Electric load forecasting

Seasonal Effects

What hour? What day? What week?

Weather Effects

Temperature? Sunny?

Accurate forecast

Model update: Every week!

46

Page 47: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

47

250 transformer substationsEvery 15 min, 5 years

1 post, 1 week 1 post, four seasons

Page 48: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Electric load forecasting

48

Page 49: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Seasonalities in the load: day, week, year, holidays

49

Page 50: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

6 posts, 1 yearSeasonalities, calender holidays !

50

Page 51: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

51

Page 52: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

- Seasonalities = a priori information (regress Monday on Monday !)- Normalization:

- remove effects of temperature, cloudiness,…. - remove effect of holidays – calender days (use dummies)

- Calculate ‘eigenprofiles’ = daily shape per post

52

Page 53: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

53

Page 54: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Top 99 %

Bottom 1 %

steam 15.2 T/h

Top 99.75 % (99.8%)

Bottom 0.25 % (0.2 %)

steam 20 T/h

Top 99.8 % (99.8%)

Bottom 0.31 % (0.2 %)

steam 20 T/h

Setpoint

changes

Idealrank top

3 -> 2

54

Page 55: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Modelling for control

900

901

902

903

900

901

902

903

100 200 300 400 500 600 700 800 900 1000

900

901

902

903

No control, Quasi steady state and fast control

100

0 10 20 30 40 50 60 70 80 90 100

Histograms

No c

on

trol

Qu

asi s

tea

dy

co

ntro

lF

ast d

yn

am

ic

co

ntro

l

55

Page 56: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

vgl,

hgl

qbg

vopw,

hopw

qman

qopw

K7 A

v1,

h1

E

vs, hs

qK7

qA qE

vs2, hs2

qs

vs3, hs3

D

qs2 qs3

vvg, hvg

qhopw

vs4, hs4

qs4 qhs

vhopw,

hhopw

qvs qD

q2

qzbopw qzb1

qgopw

qgafw

v3,

h3

qvopw

hvopw

qK18

q3 q4

vw,

hw

qK19 qK30

qbgopw

qK7lg

qgl

vbg,

hbg

qK7bg

qzb3

q7 Demer

Zw

artew

ater

Zwartebeek

Vlootgracht

Schulensmeer

Webbekom

Herk

G

ete

Beg

ijn

eb

eek

Velp

e

Leu

geb

eek

Gro

te L

eig

ra

ch

t

Ho

uw

ersb

eek

q6 q5

qK31

qzb2

v4,

h4

vzb,

hzb

qzw

q1

vlg,

hlg

qK24B

hbgopw

qK24A

vzw,

hzw

qzwopw

qhs2

v2,

h2

qh

qsa

MPC Estimator

disturbances

Model Based Predictive Control for Flood Regulation: Demer

56

Page 57: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

57

Page 58: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

58

Page 59: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Data assimilation is the common name given to several numerical techniques thatcombine the outputs of a numerical model with observational data in order toimprove the quality of the model predictions.

Some data assimilation techniques: 3DVAR, 4DVAR, Ensemble Kalman Filter (EnKF)and its variants, Optimal Interpolation (OI), particle filters, etc.

Measurements

Numerical model

( 1) ( ), ( ) ( )k k k k x f x u Gw

Data assimilation techniques

EnKFDEnKF

ETKF

Parameters of the algorithms

Better estimation of

x(k)OI

Data Assimilation

59

Page 60: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

0 50 100 150 200 250 3000

20

40

60

80

100

120

140

Concentr

ation [

ug/m

3]

Time [Hours]

Measurements

Aurora (Model)

OI

DEnKF

O3 air-quality stationsAverage of the O3 concentration over

the validation stations

The Deterministic Ensemble Kalman Filter (DEnKF) and the OI technique have beenused to improve the O3 estimates of the Air-quality model AURORA.

- Validation stations

- Assimilation stations Starting date: May 28th, 2005 at midnight

3 E 4 E 5 E 6 E

50 N

51 N

2

46

40

73

22

784

1667

56

19

Data Assimilation

60

Page 61: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

o - Validation station

x - Assimilation station

3/g m

Optimal Interpolation

3/g m

DEnKF

3/g m

Free-run of Aurora

3/g m

Average of the O3 concentration field over the 14 day period

61

Page 62: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

PM10 air-quality stationsAverage of the PM10 concentration over

the validation stations

The Deterministic Ensemble Kalman Filter (DEnKF) and the OI technique have beenused to improve the PM10 estimates of the Air-quality model AURORA.

- Validation stations

- Assimilation stations Starting date: January 20th, 2010 at midnight

3 E 4 E 5 E 6 E

50 N

51 N 3

11

13

18

0 50 100 150 200 2500

20

40

60

80

100

120

140

Concentr

ation [

ug/m

3]

Time [Hours]

Measurements

Aurora

DEnKF

OI

Data Assimilation

62

Page 63: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Average of the PM10 concentration field

o - Validation station

x - Assimilation station

Optimal Interpolation

3/g m

DEnKF

3/g m

Free-run of Aurora

3/g m

63

Page 64: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Average of the PM10 concentration field

3/g m 3/g m64

Page 65: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

65

Page 66: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Journal Clustering

Find all aboutspecific topic?

66

Page 67: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Journal Clustering

67

Page 68: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Journal Clustering

68

Page 69: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Journal Clustering

69

Page 70: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Author Collaboration Clustering

70

Page 71: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Author Collaboration Clustering

71

Page 72: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Golub around the world commemoration February 29 2008

72

Page 73: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

138 seed papers + all cited and citing publications

Result: 4943 nodes, 6216 edges

Link based clustering identifiestopically homogeneous clusters.

13 papers are writtenby another L. Ljung.

Web of Science based literature network for Lennart Ljung

73

Page 74: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

“Bibliometrics”

“Patent analysis” “Information retrieval”

“Social aspects”

“Webometrics”

Community detection

74

Page 75: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

75

Page 76: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Customer Intelligence

76

Page 77: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Customer Intelligence

Score & Target

Analyze & Model

Measure response

Contact 77

Page 78: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Short

Duration Long

Duration High

Frequency International Same

Destination Off

Peak Call

Forwarding Behaviour

Change Direct call

selling X X X X PABX fraud

X X X X X Freephone

fraud X X X X Premium

rate fraud X X X X Subscription

fraud X Handset

theft X X X X X

Call frequency

Average call duration

Fraud detection on mobile phone network

78

Page 79: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

79

Page 80: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Participatory

IOTA app:population based assessment of ovariantumour malignancy:

IOTA app available in iTunes app store and on http://homes.esat.kuleuven.be/~sistawww/biomed/iota/

Leuven

Malmö

Monza

London

Maurepas Paris

RomeNapels

Milan

# patients: 1066 + 1938

Lund

Lublin

Genk

Ontario, Canada

Bologna

Sardinia

Beijing, China

80

Page 81: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

Performance

Performance of an expert Performance of IOTA models

Performance of old models

Performance of non-experts

You share, we care ! 81

Page 82: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

© Armstrong SA et al. Nat Genet. 2002 Jan;30(1):41-7. 12 600 genes 72 patients

- 28 Acute Lymphoblastic Leukemia (ALL)- 24 Acute Myeloid Leukemia (AML)- 20 Mixed Linkage Leukemia (MLL)

Genomic markers for Leukemia

82

Page 83: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

High-throughputgenomics

Data analysis Candidate genes

Information sources

Candidate prioritization

Validation

Endeavour: Aerts et al., Nature Biotechnology, 2006

Genomic Data Fusion

83

Page 84: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

84

Page 85: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

85

Page 86: Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data Mining Prof.Dr. Bart De Moor Bart.DeMoor@iminds.be 1

86