32
András Benczúr [email protected] Head, Informatics Laboratory Informatics Laboratory, Big Data Research Group http://datamining.sztaki.hu/ Support from NADINE : New tools and Algorithms for DIrected NEtwork analysis (ICT-2011.9.1 FET Open No 288956)) Institute for Computer Science and Control, Hungarian Academy of Sciences

Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Embed Size (px)

Citation preview

Page 1: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

András Benczúr

[email protected]

Head,

Informatics Laboratory

Informatics Laboratory, Big Data Research Group

http://datamining.sztaki.hu/

Support from NADINE : New tools and Algorithms for DIrected

NEtwork analysis (ICT-2011.9.1 FET Open No 288956))

Institute for Computer Science and Control,

Hungarian Academy of Sciences

Page 2: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

About us: Knowledge Discovery in Big Data

Machine Learning – applied lab, emphasis on business understanding and prototyping:

• Search solutions, recommender systems

• Visual Analytics

• Big Data

• Applications in o Web, Social Media

o Customer data analysis

o Sensor, Mobility, IT logs

• Data driven methods o Connection with models by …

Page 3: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Partner Lab: Systems and Control

• Jozsef Bokor, Balint Vanek

• EU FP7 Addsafe o Model based fault detection and isolation (FDI) in aerospace

o Demonstrate applicability of advanced FDI for aircraft flight control in support of the development of the European sustainable transport

o Their FDD algorithms were selected for industrial validation and implemented on Iron Bird test platform at Airbus Toulouse

• EU FP7 RECONFIGURE o Aircraft guidance and control (G&C) technologies that facilitate the

automated handling of off-nominal/abnormal events

o Their algorithms being implemented in the OSMA high-fidelity simulation environment of Airbus

• US Naval Research grant o Sense and Avoid camera system for UAVs

Page 4: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

About us: Collaboration

• Collaboration: France o Dima Shepelyanski, Klaus Frahm, U

Toulouse: Network Modeling

o Anirvan Basu, INRIA Rennes: Big Data, Grid’5000 HPC

o Internet Memory, Paris: Web Mining

• Collaboration: EU o Security FP7 projects with BAES

o EIT ICTLabs Big Data collaboration

• Collaboration: Industry in Hungary o Ericsson: mobile sensor log analytics

o AEGON:

• fusing customer data sets

• car accident claim fraud detection

o Telekom, Vodafone: search engine

Page 5: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Algorithmic and Modeling techniques for

Data Fusion A personal selection: 1. Time series classification by dynamic time

warping similarity kernels 2. L2 very sparse matrix completion with side

information 3. Systems for machine learning in data streams

Page 6: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Time series classification

Root-cause analysis: Cause of Mobile Session Release and Drop

0.6% of sessions are dropped

Page 7: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Mobile session drop prediction

LTE Base Station eNodeB CELLTRACE logs

RRC connection

setup /

Successful

handover into the

cell

UE context release/

Successful handover

out of the cell

Per UE measurement

reports

(RSRP, neighbor cell

RSRP list)

Per UE traffic report

(traffic volumes, protocol events

(HARQ, RLC))

Per radio UE measurement

(CQI, SINR)

Period: 1.28s

STAR

T END

no drop

drop

Page 8: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

AUC: 0.9315 FPR: 0.03 ,TPR: 0.7

FPR: 0.2 ,TPR: 0.89

Baseline: AdaBoost

• Base classifiers are Decision Stumps: C1, C2, …, CT (attribute-threshold pairs)

• In step m, find best Cm in predefined class using weights wi

• Error rate m, sent through logit function to get αm, the importance of the classifier

• Weight of an instance

(mobile session):

Best attributes selected:

1. Maximum of RLC uplink

2. Mean of RLC uplink

3. HARQNACK downlink Max

4. Mean Change in RLC uplink

5. Mean of SINR PUSCH

)( ifexp

)( ifexp)()1(

iim

iim

m

m

im

iyxC

yxC

Z

ww

m

m

Page 9: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

i

i+2

i

i i

time time

Any distance (Euclidean,

Manhattan, …) which aligns the

i-th point on one time series with

the i-th point on the other will

produce a poor similarity score.

A non-linear (elastic) alignment

produces a more intuitive

similarity measure, allowing

similar shapes to match even if

they are out of phase in the time

axis.

Dynamic Time Warping

Page 10: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

New method: Similarity Kernel

• Full similarity matrix is too large, and … • We have six and not just one time series, hence … • Select a set R of labeled instances, by …

o Random sampling o Measuring the importance

• Each instance is represented by a 6R dimensional vector of distances from instances in R

• Choose appropriate metric o AdaBoost and other methods perform poor for the similarity

representation o Use theoretical foundation of the Markov Random Field

generated by pairs of sessions • Proof: Fisher information kernel equal to the linear kernel • Linear Support Vector Machine over the 6R dim

representation normalized as given by the theory

Page 11: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Results

• New method beats the baseline

• Used machine learning theory to set appropriate normalization

• In a continuation project, we will perform root cause analysis of radio station error logs from a wide class o In part, unsupervised task

• Fruitful collaboration also resulting in new theory for (multi)time series

Page 12: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Algorithmic and Modeling techniques for

Data Fusion A personal selection: 1. Time series classification by dynamic time

warping similarity kernels 2. L2 very sparse matrix completion with side

information 3. Systems for machine learning in data streams

Page 13: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

L2 Matrix Completion with side info

• Learn low rank models of observations, eg. to predict risk o Paired sensor events with detailed reports coming from separate units?

• Predict for unknown matrix elements – VERY SPARSE (<1%) case o Stochastic Gradient Descent

o Alternating Least Squares

o Restricted Boltzmann Machines

• Generalization: dyadic and multi-adic classification o Steffen Rendle’s Factorization Machine

user transaction t gender age factor contact history

Page 14: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Data Example: RecSys Challenge 2014

A multi-facet matrix completion task: 2014.recsyschallenge.com user_id 824143586 tweet_id 324077847926931000 movie_id 1045658 rating 8 rating_diff_from_avg 0,8 user_created_at 1347659915 user_followers_count 165 user_friends_count 258 user_favourites_count 33 user_statuses_count 451 tweet_creation_time 1366097554 tweet_hashtag_num 1 tweet_url_num 1 tweet_mention_num 0 tweet_is_retweet 0 tweet_has_retweet 0 movie_avg_rating 41829 movie_rating_num 384348 movie_genre Romance, Comedy …

engagement ?

Page 15: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

1 4 3

4

4 4

4

2

1,4

-0,2

0,8

0,5

-1,3

-0,4 1,6

-0.1 0.5

0,3

1,2 -0,5 1,1 -0,4

1,2 0,9

0,4 -0,4

1,2 -0,3

1,3

-0,1

0,9

0,4

1,1 -0,2

1,5

0,0

1,1 0,8

-1,2

-0,3

1,2 0,9

1,6

0,1 1,5

0,0

0,5 -0,3

-1,1

-0,2

0,4 -0,2 0,5 -0,1

0.6

0,2

P

Q

R

Source: D. Tikk, Gravity RD

Matrix completion simplest algorithm

Animated

Page 16: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Source: D. Tikk, Gravity RD

Matrix completion simplest algorithm

Page 17: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

1 4 3

4

4 4

4

2

1,5

-1,0

2,1

0,8

1,0

1,6 1,8

0.7 1.6

0,0

1,4 1,1

0,9 1,9

2,5 -0,3

P

Q

R 3.3 2.4

-0.5 3.5 1.5

1.1 4.9

Source: D. Tikk, Gravity RD

Matrix completion simplest algorithm

Page 18: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

The Factorization Machine

• General formula:

• Matrix factorization: x=(0…0,1,0…0;0…0,1,0…0)

• Can be extended with more context: o Tensors

o Nearest neighbors

o Time series, history

o …

Global bias

Pairwise interaction

Regression: strength of variable i

Factorization

Row Column

Sensor 1 Se

nso

r 2

Page 19: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Advantages over standard methods

• Also works for very sparse (<1% known) matrices o Mathematical theory of dimensionality reduction for full matrices only

• Can handle regularization terms o Traditional methods overfit to training and fail with predictions

• May be used for learning from data stream o For time sensitive modeling, old events are slowly “forgotten”

• Minimizes L2 not L1 (as in Compressed Sensing) o More emphasis on large prediction error

• Highly scalable (Netflix data 100M ratings, < 1 hour)

• Source: recommender systems o not yet applied for sensory data

o Applicability for dynamic systems (similar to Kalman filters) not yet explored

Page 20: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Performance Example: RecSys Challenge 2014

A multi-facet matrix completion task: 2014.recsyschallenge.com

Winner (INRIA) SZTAKI (2nd place) Best combination

0.876 0.874

Support Vector Machines 0.868

Gradient Boosted Tree 0.871

Linear Regression 0.860

Logistic Regression 0.860 Matrix Factorization with Side Information 0.862 LibFM: Factorization Machine 0.841 Learning to Rank (NDCGBoost) 0.862

user_id 824143586 tweet_id 324077847926931000 movie_id 1045658 rating 8 rating_diff_from_avg 0,8 user_created_at 1347659915 user_followers_count 165 user_friends_count 258 user_favourites_count 33 user_statuses_count 451 tweet_creation_time 1366097554 tweet_hashtag_num 1 tweet_url_num 1 tweet_mention_num 0 tweet_is_retweet 0 tweet_has_retweet 0 movie_avg_rating 41829 movie_rating_num 384348 movie_genre Romance, Comedy …

engagement ?

Page 21: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Algorithmic and Modeling techniques for

Data Fusion A personal selection: 1. Time series classification by dynamic time

warping similarity kernels 2. L2 very sparse matrix completion with side

information 3. Systems for machine learning in data streams

Page 22: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

“New” computational paradigms

• MapReduce and beyond o Originates from Google

o Hadoop as mature open source framework

o Hadoop reached its limits with matrix operations, machine learning

o Emerging technologies: Graph frameworks, Stratosphere/Flink

• GPGPU: Cuda, OpenCL, … o Limitations: low memory, low level I/O support only, costly programming

• Data streams o Models: low space sketches, database synopses, approximate counting

o Frameworks: Apache Spark, Twitter Storm, Stratosphere/Flink

• Fully distributed cooperative learning o Based on peer-to-peer and ad hoc networks

o Current research focusing on privacy

o Will have security applications too…

Page 23: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Content

1. Data structures for data streaming

2. New software issues in data streaming systems 1. Available operations

2. Combination of batch computed models with real time data

3. Fully distributed learning

Page 24: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Data structures for streams • Idea: Memory is limited, stream is potentially infinite

o Sliding windows

o Special tricks for sampling

o So-called “sketches”or “synopses” that summarize approx info

• Bloom filters: low space approximate membership testing

• Tug-of-War sketches: low space joins and moments o Random Hash h(i): {0,1,…,N-1} → {-1,1}

o Define Zi =h(i)

o Maintain X = Σi miZi

Page 25: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Software Technology: Map and Reduce

• All starts here: Google idea to build search index

Second-order function

First-order function (user code)

Data Data

Map Reduce

Input Splitting Mapping Shuffling Reducing Output

data air tlse

stream tlse data

tlse air stream

data air tlse

stream tlse data

tlse air steam

data,1

air,1

tlse,1

stream,1

tlse,1

data

tlse,1

air,1

stream,1

air,1

air,1

tlse,1

tlse,1

tlse,1

data,1

data,1

stream,1

stream,1

air,2

tlse,3

data,2

stream,3

air,2

data,2

stream,2

tlse,3

Page 26: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Parallelization second order functions

• Introduced by Stratosphere/Apache Flink

• Complex workflows automatically optimized o Model: RDBMS executing SQL commands

Second-order function

First-order function (user code)

Data Data

Cross Join CoGroup

Page 27: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Parallel Matrix Completion may be Difficult

• Alternating Least Squares single iteration: o 𝑞𝑖 = 𝑃𝑇𝑃 −1𝑃𝑇 𝑅𝑖 = 𝑃𝑇𝑃 −1𝑃𝑗

𝑇 𝑅𝑖𝑗𝑁𝑗=1

o Partition by i

o Broadcast 𝑃𝑇𝑃, just a kxk matrix – but vast communication overhead

• More iterations 𝑃𝑇𝑃 −1;

𝑃𝑎𝑇; 𝑅𝑖𝑎

𝑃𝑇𝑃 −1;

𝑃𝑒𝑇; 𝑅𝑖𝑒

Page 28: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Streaming parallelization

• Same operations available over streams

• Workflow optimization

• Below is example for a matrix completion learning o Feedback loop for previous model iteration

Streaming dataflow with feedback

map

join

red.

join

Page 29: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

The Streaming Lambda Architecture

• Model precomputed by analyzing very large historic collections stored on eg. a large distributed file system

• Model needs to be (1) applied to predict over a stream (2) adapted to changes in environment

Page 30: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Fully Distributed Modeling

• Needs no central service – suitable for: o Ad hoc networks

o Privacy requirements

• Model delta updates are sent to peers

• Results for applicability in: o Classification

o Matrix completion

R P1Q1

R P2(Q2+Q)

Measurement

Q

Q

Hegedus, I., Jelasity, M., Kocsis, L., & Benczúr, A. A. (2014). Fully distributed robust singular value decomposition. In Peer-to-Peer Computing (P2P) IEEE. Best Paper

Page 31: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Conclusions

• Three main areas o Multi time series in error prediction

o Matrix completion with very large number of missing values

o Data stream processing

• For data streaming solutions, we have to combine o Batch pre-computed models updated real time (lambda architecture)

o Very low memory data approximation

o Carefully selected database operations to optimize communication

• Machine learning, prediction, classification made o Highly time sensitive, streaming

o Fully distributed: each element learns by passing model error to peers

Page 32: Computer and Automation Institute Hungarian Academy of ... · PDF fileA non-linear (elastic) alignment ... • All starts here: Google idea to build search index ... The Streaming

Recent publications

• Pálovics, R., Benczúr, A. A., Kocsis, L., Kiss, T., & Frigó, E. (2014). Exploiting temporal influence in online recommendation. In Proceedings of the 8th ACM Conference on Recommender systems. ACM.

• Pálovics et at., RecSys Challenge 2014: an ensemble of binary classifiers and matrix factorization (2nd place)

• Hegedus, I., Jelasity, M., Kocsis, L., & Benczúr, A. A. (2014). Fully distributed robust singular value decomposition. In Peer-to-Peer Computing (P2P) IEEE. Best Paper

• Erdelyi et al., The classification power of Web features. Internet Mathematics, 2014

• L. Kocsis, A. György, A. N. Bán., BoostingTree: Parallel Selection of Weak Learners in Boosting, with Application to Ranking. Machine Learning, 2013.

• Garzo et al., Real-time streaming mobility analytics. NetMob 2013

• Eom, Frahm, Benczur, Shepelyansky. Time evolution of Wikipedia network ranking. Europ J Phys 2014.

• C. Sidló, A. Garzó, A. Molnár, A.A. Benczúr, Infrastructures and Bound for Distributed Entity Resolution, in Proc. QDB in conj. VLDB 2011.

• Gelly, S., Kocsis, L., Schoenauer, M., Sebag, M., Silver, D., Szepesvári, C., & Teytaud, O. (2012). The grand challenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3), 106-113.

scholar.google.com/citations?

user=bPbaq5UAAAAJ