72
Speaker : Alexey Zinoviev Java as a fundamental working tool of the Data Scientist

Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Speaker : Alexey Zinoviev

Java as a fundamental working tool of the Data Scientist

Page 2: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

About

● I am a <graph theory, machine learning, traffic jams prediction,

BigData algorythms> scientist

● But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark>

programmer

Page 3: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 4: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

One of these fine days...

Page 5: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

We need in Python dev 'cause Data Mining

Page 6: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

You're a programmer, not an analyst

Page 7: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Write your backends!

Page 8: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Let’s talk about it, Java-boy...

Page 9: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Data mining

Mining coal in your data

Page 10: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Hey, man, predict me something!

Page 11: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Man or sofa?

Page 12: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Which loan applicants are high-risk?

● How do we detect phone card fraud?

● Which customers do prefer product A over product B?

● What is the revenue prediction for next year?

Typical questions for DM

Page 13: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

What is Data Mining?

13/72

Page 14: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Statistics?

Page 15: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Tag cloud?

Page 16: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Data visualization?

Page 17: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Not OLAP, 100%

Page 18: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

1. Selection2. Pre-processing3. Transformation4. Data Mining5. Interpretation/Evaluation

Magic part of KDD (Knowledge Discovery in Databases)

Page 19: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

1. Share your date with us

2. Our magic manipulations

3. Building an answering machine

4. PROFIT!!!

How it really works

Page 20: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Data

20/72

Page 21: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Facebook users, tweets

● Weather

● Sea routes

● Trade transactions

● Goverment

● Medicine (genomic data)

● Telecommuncations (phone call records)

Data examples

Page 22: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Relational Databases

● Data warehouses (Historical data)

● Files in CSV or in binary format

● Internet or electronic mails

● Scientific, research (R, Octave,

Matlab)

Data sources

Page 23: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Target Data & Personal Data

23/72

Page 24: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 25: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● All your personal data (PD) are being

deeply mined

● The industry of collecting, aggregating,

and brokering PD is “database marketing.”

● 1.1 billion browser cookies, 200 million

mobile profiles, and an average of 1,500

pieces of data per consumer in Acxiom

Pay with your personal data

Page 26: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Preprocessing

26/72

Page 27: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Select small pieces● Define default values for missed data● Remove strange signals from data● Merge some tables in one if required

Page 28: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Pattern mining

28/72

Page 29: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Association rule learning

Page 30: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

It is the process of finding model of function that describes and

distinguishes data class to predict the class of objects whose class

label is unknown.

What is Cluster Analysis?

Page 31: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Statistical process for estimating the

relationships among variables

● The estimation target is function (it can

be probability distribution)

● Can be linear, polynomial, nonlinear and

etc.

Regression

Page 32: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Training set of classified examples (supervised learning)

● Test set of non-classified items● Main goal: find a function (classifier) that

maps input data to a category● Computer vision, drug discovery, speech

recognition, biometric indentification, credit scoring

Classification

Page 33: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Decision trees

Page 34: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Decision trees

Page 35: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● There are two classes of objects A & B

● Define the class of new object, based on

information about its neighbors

● Changing the boundaries of an new object

area, we form a set of neighbors.

● New object is B becuase majority of the

neighbors is a B.

kNN (k-nearest neighbor)

Page 36: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Skills & Tools

36/72

Page 37: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 38: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 39: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Fashion Languages

39/72

Page 40: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 41: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● It’s free

● Not full implemented stack of ML

algorythms

● All your matrix are belong to us!

● Single thread model

● Java support

Why not Octave?

Page 42: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 43: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● 25% of R packages are written in Java

● Syntax is too sweet

● You should read 1000 lines in docs to

write 1 line of code

● Single thread model for 95%

algorythms

Why not R?

Page 44: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Now Python is an idol for young scientists

due to the low barrier to entry

● We are not Python developers

● High-level language

● Have you ever heard about a Jython?

● Long long way to real Highload

production

Why not Python?

Page 45: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

DM libraries in Python

Page 46: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Java ecosystem

46/72

Page 47: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Java API for Data mining, JSR 73 and JSR 247

● javax.datamining.supervised defines the supervised

function-related interfaces

● javax.datamining.algorithm contains all mining algorithm

subclass packages

● JDM 2.0 adds Text Mining, Time series and so on..

JDM

Page 48: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Connectors to R, Octave, Matlab, Hadoop,

NoSQL/SQL databases

● Source code of all algorythms in Java

● Preprocessing tools: discretization,

normalization, resampling, attribute

selection, transforming and combining

Weka

Page 49: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 50: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 51: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Weka + Hadoop

Page 52: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● It’s codebase of algorythms in pattern

mining field

● It has cool examples and implementation

of 78 algorythms

● Cool performance results in specific area

● Codebase grows very fast

SPMF

Page 53: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Driven by Ng et al.’s paper “MapReduce for Machine Learning

on Multicore”

● Next algorythms were adopted: Locally Weighted Linear

Regression(LWLR), Naive Bayes (NB), k-means, Logistic

Regression, Neural Network (NN), Principal Components Analysis

(PCA), Support Vector Machine (SVM) and so on..

● The complexity was reduced in n times for n processors.

Mahout

Page 54: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● DataModel (File, MySQL, PostgreSQL,

Mongo, Cassandra)

● UserSimilarity

● ItemSimilarity

● UserNeighborhood

● Recommender

Mahout

Page 55: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Advanced Implementations of Java’s Collections Framework for

better Performance.

● Very close to Apache Giraph

● New algorythms will build on Spark platform

● Spark shell

● Spring + Mahout demo

● Collaborative Filtering, Classification, Clustering, Dimensionality

Reduction, Miscellaneous are supported

Mahout

Page 56: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 57: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● MapReduce in memory

● Up to 50x faster than Hadoop

● Support for Shark (like Hive), MLlib

(Machine learning), GraphX (graph

processing)

● RDD is a basic building block (immutable

distributed collections of objects)

Spark

Page 58: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Spark

Page 59: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Java 7 search example:JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("Tomcat"); }});long numErrors = lines.count();

Java 8 search example:

JavaRDD<String> lines = sc.textFile("hdfs://log.txt") .filter(s -> s.contains("Tomcat"));long numErrors = lines.count();

Spark + Java 8

Page 60: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Mahout’s killer

Page 61: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● Classification and regression. collaborative filtering and

clustering, Dimensionality reduction and Optimization are

supported

● It extends scikit-learn (Python lib) and Mahout and run on

Spark

● Well documented and integrated with many Java solutions

MLlib

Page 62: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Size Classification Tools

LinesSample Data

Analysis and Visualization Whiteboard, bash

KBs - low MBsPrototype Data

Analysis and Visualization Matlab, Octave, R

MBs - low GBsOnline Data

Storage MySQL (DBs)

MBs - low GBsOnline Data

Analysis NumPy, SciPy, Weka, BLAS/LAPACK

GBs - TBs - PBs BigData

Storage HDFS, HBase, Cassandra

GBs - TBs - PBsBig Data

Analysis Hive, Mahout, Hama, Giraph,MLlib

Page 63: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Large graph processing tools

63/72

Page 64: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 65: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 66: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Graph Number of vertexes

Number of edges

Volume Data/per day

Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB

Facebook (friends graph)

1,1 * 10^9 160 * 10^9 1 PB 15 TB

Road graph of EU

18 * 10^6 42 * 10^6 20 GB 50 MB

Road graph of this city

250 000 460 000 500 MB 100 KB

Page 67: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● High complexity of graph problem reduction to key-value model

● Iteration algorythms, but multiple chained jobs in M/R with full saving and reading of each state

Think like a vertex…

MapReduce for iterative calculations

Page 68: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 69: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

С++ API

Page 70: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering
Page 71: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

● “Mahout in Action”, Owen et. al., Manning Pub.

● “Pattern Recognition and Machine Learning”, Christopher Bishop,

Springer Pub.

● “Elements of Statistical Learning: Data Mining, Inference, and

Prediction”, Hastie et. al., Springer Pub.

● “Collective Intelligence in Action” Satnam Alag et. al., Manning

Pub.

Books and papers

Page 72: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering

Your questions?