Serious Data Mining - SIM-Flanders · iMinds Dept. MEDICAL IT KU Leuven ESAT-STADIUS Serious Data...

Preview:

Citation preview

iMinds Dept. MEDICAL IT

KU Leuven ESAT-STADIUS

Serious Data Mining

Prof.Dr. Bart De Moor

Bart.DeMoor@iminds.be1

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 2

1 million = 1 000 0001 billion = 1 000 000 0001 trillion = 1 000 000 000 0001 quadrillion = 1 000 000 000 000 000

1 kB = 1 000 1 MB = 1 000 0001 GB = 1 000 000 0001 TB = 1 000 000 000 0001 PB = 1 000 000 000 000 000

1 TB = large university library= 212 DVD discs = 1430 CDs= 3 year music in CD quality

3

4

5

6

7

Moore’s law:

computing power

doubles

every 18 months

Next GenerationSequencing

Carlson’s law:

complexity/cost

evolves

exponentially

8

Genome data

9

• Human genome project (2003)

– 13 year project

– $300 million value with 2002 technology

• Personal genome (2007)

– Genome of James Watson, 2 months

– $1 000 000

• €1000-genome

– Expected 2012-2020

1,00E-07

1,00E-06

1,00E-05

1,00E-04

1,00E-03

1,00E-02

1,00E-01

1,00E+00

1,00E+01

1,00E+02

1,00E+03

1,00E+04

1,00E+05

1,00E+06

1,00E+07

1,00E+08

1,00E+09

1,00E+10

1,00E+11

1990 1995 2000 2002 2005 2007 2010 2015

Cost per base pair

Genome cost

GS-FLX Roche Applied Science 454

Sequencers

Tsunami of medical data

PACS

UZ Leuven

1,6 PetaByte

Genomics core

HiSeq 2000 full

speed exome

sequencing

1 TeraByte / week

1 small

animal

image

1

GigaByte1 CD-ROM

750

MegaByte

sequencing all newborns

by 2020 (125k births /

year)

125 PetaByte / year

index of 20

million

Biomedical

PubMed

records

23 GigaByte

1 slice mouse

brain MSI at

10 μm

resolution

81 GigaByte

raw NGS data

of 1 full genome

1 TeraByte

10

11

Data explosion in finance

Growing ~ 30-50% every year, half of this is unstructured!

Data storage became 30% cheaper, yet budgets for data storage are still rising.

12

How big is big?

Google

?

Microsoft

1 million servers (2013)

Amazon

?

13

14

15

16

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 17

Big Data

18

Data

19

Compute infrastructure

20

Storage

21

22

AnalyticsFlowchart

23

Visualization

24

Security

25

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 26

Big Data Landscape

More and moreanalytics as a commodity!

27

Machine Learning as a commodity

28

Big Data LandscapeMany possible applications!

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

Focus onSerious Big Data

29

Content

Big DataWhatWho

Six issues DataCompute InfrastructureStorage InfrastructureAnalyticsVisualizationSecurity & Privacy

Machine learning as a commodity

Expertise of ESAT-STADIUS, KU Leuven

Books & Spin-offsAlgorithmsApplications 30

Analytics

Big Data Analytics

Numerical algorithms Rule-based

31

Main tasks

Prediction Segmentation Anomalies

Regression Clustering

Classification

Outlier

32

What can we do?

Shopping cart analysis

Fraud detection

Face recognition

Movie reccomendation

Just-in-time production

Credit worthiness

Disease spreadingTraffic management

33

black blond

orange

blue

long

short

Hair color

length

Color clothes

FeauturesClustersSimilarityDecision

Clustering/Classification

34

Forecasting

35

Deep Learning● Neural networks.● New algorithms.● Multiple layers on

top of each other.● Each layer learns a

more complex representation.

● Learn feature hierarchies.

Stadius - Books

37

Stadius - Spin-offs

38

DATA

ANALYTICS

Supervised learning Unsupervised learning Statistical methodsOther analytical

techniques

General

methodology: Dusan

Popovic, Marc

Claesen, Jack Simm,

Peter Roelants

Esemble Methods :

Dusan Popovic, Marc

Claesen, Jaak Simm

Semi-supervised

learning : Marc

Claesen

Large-scale learning :

Marc Claesen, Nico

Verbeeck

Hyperparameter

optimization : Marc

Claesen, Dusan

Popovic

Optimization meta-

heuristics : Dusan

Popovic, Maira

Rodriguez

Neural Networks :

Peter Roelants, Dusan

Popovic

Deep learning : Peter

Roelants

ICA : Nico Verbeeck

Clustering : Yousef el

Alamaat

Matrix factorization :

Jaak Simm

Manifold learning :

Xian Mao

Decision Trees :

Dusan Popovic, Jaak

Simm

Regularized

regression & GLM :

Yousef el Alamaat,

Arnaud Installe

Wavelet compression

:

Nico Verbeeck

Visualization : Toni

Verbeiren, Ryo Sakai

Multi-task learning :

Jaak Simm

Data fusion : Dusan

Popovic, Pooya Zakeri

, Gorana Nikolic

General

methodology:

Yousef el Alamaat,

Dusan Popovic, Marc

Claesen

SVMs & Kernel

Methods : Marc

Claesen, Jaak Simm,

Pooya Zakeri

Survival analysis :

Marc Claesen

Algorithms in Stadius

39

Big DataApplications

Large-scalescience

Bio-Informatics

Data Assimilation

Health Smart City

Electricity

Internet of Things

Traffic

Finance

Risk Assesment

Stock Trading

Customer Relation

Management

Fraud Detection

Churn

Text Mining

Stadius

40

BIOMEDICAL APPLICATIONS

Diseases/Bodily

processes/Tissues

Technological platforms/data

types

Cancer : Dusan Popovic, Xian

Mao

Diabetes : Marc Claesen

Endometriosis : Yousef el

Alamat, Dusan Popovic, Nico

Verbeek

Mass Spectrometry: Nico

Verbeek, Yousef el Alamat, Xian

Mao

Genomics : Inge Thijs, Dusan

Popovic, Maira Rodriguez, Ryo

Sakai

Proteomics : Inge Thijs, Yousef

el Alamat, Xian Mao, Ryo Sakai,

Pooya Zakeri

Microarrays : Dusan Popovic,

Griet Laenen, Pooya Zakeri

miRNA : Jaak Simm

RNA sequencing : Jaak Simm

NGS : Amin Ardeshirdavani, Jaak

Simm

Metagenomics : Jaak Simm

Rare genetic disorders : Dusan

Popovic, Ryo Sakai

IT support

Object-oriented programming

: Arnaud Installe, Amin

Ardeshirdavani, Gorana Nikolic,

Marc Claesen, Toni

Verbeiren,Xian Mao

Web applications : Amin

Ardeshirdavani, Gorana Nikolic,

Toni Verbeiren

Databases : Arnaud Installe,

Amin Ardeshirdavani, Gorana

Nikolic

Big data technologies (ex.

Hadoop , MapReduce) : Amin

Ardeshirdavani, Jaak Simm, Toni

Verbeiren

Distributed computing : Toni

Verbeiren

Glucose level monitoring : Tom

Van Herpe

Mouse brain : Nico Verbeek,

Yousef el Alamat, Xian Mao

High-performance computing :

Jaak Simm

Bacteria&Fungi : Jaak SimmAgent-based systems : Maira

Rodriguez

Clinical data : Arnaud Installe,

Dusan Popovic, Yousef el Alamat

Annotation data : Pooya Zakeri

Biological networks : Griet

Laenen, Pooya Zakeri

Drug discovery : Griet Laenen,

Pooya Zakeri

41

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

42

Power grid

43

Electric load forecastingProblem

Energy Production

Energy Demand

How to forecast the demand?

44

Electric load forecasting

Data sets

Elia KMI

Transmission system operator

Hourly data

10 substationsconsidered

Meteorological Institute

Weatherconditions

45

Electric load forecasting

Seasonal Effects

What hour? What day? What week?

Weather Effects

Temperature? Sunny?

Accurate forecast

Model update: Every week!

46

47

250 transformer substationsEvery 15 min, 5 years

1 post, 1 week 1 post, four seasons

Electric load forecasting

48

Seasonalities in the load: day, week, year, holidays

49

6 posts, 1 yearSeasonalities, calender holidays !

50

51

- Seasonalities = a priori information (regress Monday on Monday !)- Normalization:

- remove effects of temperature, cloudiness,…. - remove effect of holidays – calender days (use dummies)

- Calculate ‘eigenprofiles’ = daily shape per post

52

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

53

Top 99 %

Bottom 1 %

steam 15.2 T/h

Top 99.75 % (99.8%)

Bottom 0.25 % (0.2 %)

steam 20 T/h

Top 99.8 % (99.8%)

Bottom 0.31 % (0.2 %)

steam 20 T/h

Setpoint

changes

Idealrank top

3 -> 2

54

Modelling for control

900

901

902

903

900

901

902

903

100 200 300 400 500 600 700 800 900 1000

900

901

902

903

No control, Quasi steady state and fast control

100

0 10 20 30 40 50 60 70 80 90 100

Histograms

No c

on

trol

Qu

asi s

tea

dy

co

ntro

lF

ast d

yn

am

ic

co

ntro

l

55

vgl,

hgl

qbg

vopw,

hopw

qman

qopw

K7 A

v1,

h1

E

vs, hs

qK7

qA qE

vs2, hs2

qs

vs3, hs3

D

qs2 qs3

vvg, hvg

qhopw

vs4, hs4

qs4 qhs

vhopw,

hhopw

qvs qD

q2

qzbopw qzb1

qgopw

qgafw

v3,

h3

qvopw

hvopw

qK18

q3 q4

vw,

hw

qK19 qK30

qbgopw

qK7lg

qgl

vbg,

hbg

qK7bg

qzb3

q7 Demer

Zw

artew

ater

Zwartebeek

Vlootgracht

Schulensmeer

Webbekom

Herk

G

ete

Beg

ijn

eb

eek

Velp

e

Leu

geb

eek

Gro

te L

eig

ra

ch

t

Ho

uw

ersb

eek

q6 q5

qK31

qzb2

v4,

h4

vzb,

hzb

qzw

q1

vlg,

hlg

qK24B

hbgopw

qK24A

vzw,

hzw

qzwopw

qhs2

v2,

h2

qh

qsa

MPC Estimator

disturbances

Model Based Predictive Control for Flood Regulation: Demer

56

57

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

58

Data assimilation is the common name given to several numerical techniques thatcombine the outputs of a numerical model with observational data in order toimprove the quality of the model predictions.

Some data assimilation techniques: 3DVAR, 4DVAR, Ensemble Kalman Filter (EnKF)and its variants, Optimal Interpolation (OI), particle filters, etc.

Measurements

Numerical model

( 1) ( ), ( ) ( )k k k k x f x u Gw

Data assimilation techniques

EnKFDEnKF

ETKF

Parameters of the algorithms

Better estimation of

x(k)OI

Data Assimilation

59

0 50 100 150 200 250 3000

20

40

60

80

100

120

140

Concentr

ation [

ug/m

3]

Time [Hours]

Measurements

Aurora (Model)

OI

DEnKF

O3 air-quality stationsAverage of the O3 concentration over

the validation stations

The Deterministic Ensemble Kalman Filter (DEnKF) and the OI technique have beenused to improve the O3 estimates of the Air-quality model AURORA.

- Validation stations

- Assimilation stations Starting date: May 28th, 2005 at midnight

3 E 4 E 5 E 6 E

50 N

51 N

2

46

40

73

22

784

1667

56

19

Data Assimilation

60

o - Validation station

x - Assimilation station

3/g m

Optimal Interpolation

3/g m

DEnKF

3/g m

Free-run of Aurora

3/g m

Average of the O3 concentration field over the 14 day period

61

PM10 air-quality stationsAverage of the PM10 concentration over

the validation stations

The Deterministic Ensemble Kalman Filter (DEnKF) and the OI technique have beenused to improve the PM10 estimates of the Air-quality model AURORA.

- Validation stations

- Assimilation stations Starting date: January 20th, 2010 at midnight

3 E 4 E 5 E 6 E

50 N

51 N 3

11

13

18

0 50 100 150 200 2500

20

40

60

80

100

120

140

Concentr

ation [

ug/m

3]

Time [Hours]

Measurements

Aurora

DEnKF

OI

Data Assimilation

62

Average of the PM10 concentration field

o - Validation station

x - Assimilation station

Optimal Interpolation

3/g m

DEnKF

3/g m

Free-run of Aurora

3/g m

63

Average of the PM10 concentration field

3/g m 3/g m64

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

65

Journal Clustering

Find all aboutspecific topic?

66

Journal Clustering

67

Journal Clustering

68

Journal Clustering

69

Author Collaboration Clustering

70

Author Collaboration Clustering

71

Golub around the world commemoration February 29 2008

72

138 seed papers + all cited and citing publications

Result: 4943 nodes, 6216 edges

Link based clustering identifiestopically homogeneous clusters.

13 papers are writtenby another L. Ljung.

Web of Science based literature network for Lennart Ljung

73

“Bibliometrics”

“Patent analysis” “Information retrieval”

“Social aspects”

“Webometrics”

Community detection

74

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

75

Customer Intelligence

76

Customer Intelligence

Score & Target

Analyze & Model

Measure response

Contact 77

Short

Duration Long

Duration High

Frequency International Same

Destination Off

Peak Call

Forwarding Behaviour

Change Direct call

selling X X X X PABX fraud

X X X X X Freephone

fraud X X X X Premium

rate fraud X X X X Subscription

fraud X Handset

theft X X X X X

Call frequency

Average call duration

Fraud detection on mobile phone network

78

Energy

Industry

Environment

Social networks

Fraud and predictive analysis

Health

79

Participatory

IOTA app:population based assessment of ovariantumour malignancy:

IOTA app available in iTunes app store and on http://homes.esat.kuleuven.be/~sistawww/biomed/iota/

Leuven

Malmö

Monza

London

Maurepas Paris

RomeNapels

Milan

# patients: 1066 + 1938

Lund

Lublin

Genk

Ontario, Canada

Bologna

Sardinia

Beijing, China

80

Performance

Performance of an expert Performance of IOTA models

Performance of old models

Performance of non-experts

You share, we care ! 81

© Armstrong SA et al. Nat Genet. 2002 Jan;30(1):41-7. 12 600 genes 72 patients

- 28 Acute Lymphoblastic Leukemia (ALL)- 24 Acute Myeloid Leukemia (AML)- 20 Mixed Linkage Leukemia (MLL)

Genomic markers for Leukemia

82

High-throughputgenomics

Data analysis Candidate genes

Information sources

Candidate prioritization

Validation

Endeavour: Aerts et al., Nature Biotechnology, 2006

Genomic Data Fusion

83

84

85

86

Recommended