73
Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto YUI @myui 2017/5/16 Apache BigData North America '17, Miami 2). Research Engineer, NTT Takashi Yamamuro @maropu @ApacheHivemall

Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

ApacheHivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig

1).ResearchEngineer,TreasureDataMakotoYUI@myui

2017/5/16ApacheBigData NorthAmerica'17,Miami

1

2).ResearchEngineer,NTTTakashiYamamuro@maropu

@ApacheHivemall

Page 2: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Planofthetalk

1. IntroductiontoHivemall

2. HivemallonSpark

2017/5/16ApacheBigDataNorthAmerica'17,Miami 2

Page 3: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 3

HivemallenteredApacheIncubatoronSept13,2016🎉

Sincethen,weinvited2contributorsasnewcommitters(acommitterhasbeenvotedasPPMC). Currently,weareworkingtowardthefirstApachereleaseonthisQ2.

hivemall.incubator.apache.org

Page 4: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

2017/5/16ApacheBigDataNorthAmerica'17,Miami

4

Multi/Crossplatform VersatileScalableEase-of-use

Page 5: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Hivemalliseasyandscalable…MLmadeeasyforSQLdevelopers

Borntobeparallelandscalable

52017/5/16ApacheBigDataNorthAmerica'17,Miami

Ease-of-use

Scalable

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

ThisqueryautomaticallyrunsinparallelonHadoop

Page 6: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 6

Hivemallisamulti/cross-platformMLlibrary

HiveQL SparkSQL/Dataframe API PigLatin

HivemallisMulti/Crossplatform..Multi/Crossplatform

predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive

Page 7: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Hivemall’s TechnologyStack

2017/5/16ApacheBigDataNorthAmerica'17,Miami 7

AmazonS3

Page 8: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 8

HivemallonApacheHive

Page 9: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 9

HivemallonApacheSparkDataframe

Page 10: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 10

HivemallonSparkSQL

Page 11: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 11

HivemallonApachePig

Page 12: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 12

OnlinePredictionbyApacheStreaming

Page 13: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 13

Versatile

HivemallisaVersatilelibrary..

ü NotonlyforMachineLearningü providesabunchofgenericutilityfunctions

EachorganizationhasownsetsofUDFsfordatapreprocessing

Don’tRepeatYourself!Don’tRepeatYourself!

Page 14: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 14

Hivemallgenericfunctions

ArrayandMap Bitandcompress StringandNLP

WewelcomecontributingyourgenericUDFstoHivemall!

GeoSpatial

Top-kprocessing

> TF/IDF

> TILE> MAP_URL

Page 15: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 15

Maptilingfunctions

Page 16: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 16

Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n

Maptilingfunctions

Zoom=10

Zoom=15

Page 17: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 17

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass

Page 18: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 18

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2

RANKover()querydoesnotfinishesin24hoursLwhere20millionMOOCsclassesandavg1,000studentsineachclasses

Page 19: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 19

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

)t

EACH_TOP_Kfinishesin2hoursJ

Page 20: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 20

Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Node1

Sortbyclass,score

rankover()

rank>=2

Page 21: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 21

Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Node1

Sortbyclass

each_top_k

OUTPUTonlyKitems

Page 22: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 22

ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTING ISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

Page 23: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

ListofSupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

2017/5/16ApacheBigDataNorthAmerica'17,Miami 23

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

Page 24: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

RandomForestinHivemall

EnsembleofDecisionTrees

2017/5/16ApacheBigDataNorthAmerica'17,Miami 24

Page 25: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

TrainingofRandomForest

2017/5/16ApacheBigDataNorthAmerica'17,Miami 25

Page 26: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

PredictionofRandomForest

2017/5/16ApacheBigDataNorthAmerica'17,Miami 26

Page 27: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

SupportedAlgorithmsforRecommendation

2017/5/16ApacheBigDataNorthAmerica'17,Miami 27

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

Page 28: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

OtherSupportedAlgorithms

2017/5/16ApacheBigDataNorthAmerica'17,Miami 28

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ FeatureBinning✓ TF-IDFvectorizer✓ PolynomialExpansion✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer

Evaluationmetrics✓AUC,nDCG,logloss,precisionrecall@K,andetc

Page 29: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 29

FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins

Page 30: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 30

FeatureSelection– SignalNoiseRatio

Page 31: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 31

EvaluationMetrics

Page 32: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

OtherSupportedFeatures

2017/5/16ApacheBigDataNorthAmerica'17,Miami 32

AnomalyDetection✓LocalOutlierFactor(LoF)✓ChangeFinder

Clustering/Topicmodels✓Onlinemini-batchLDA✓Onlinemini-batchPLSA

ChangePointDetection✓ChangeFinder✓SingularSpectrumTransformation

Page 33: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami 33

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Page 34: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami 34

Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Page 35: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 35

• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.

• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation

Page 36: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 36

Onlinemini-batchLDA

Page 37: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

ü XGBoost Integrationü SparsevectorsupportinRandomForestsü Dockersupport(forevaluation)ü Field-awareFactorizationMachinesü GeneralizedLinearModel

• OptimizerframeworkincludingADAM• L1/L2regularization

2017/5/16ApacheBigDataNorthAmerica'17,Miami 37

Othernewfeaturesindevelopment

Page 38: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Copyright©2016 NTT corp. All Rights Reserved.

Hivemall on

Page 39: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

39Copyright©2016 NTT corp. All Rights Reserved.

Who am I• Takeshi Yamamuro• NTT corp. in Japan• OSS activities

• Apache Hivemall PPMC• Apache Spark contributer

Page 40: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

40Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• Distributed data analytics engine, generalizing Map Reduce

Page 41: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

41Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...

Page 42: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

42Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

Whatʼs Hivemall on Spark?

Page 43: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

43Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall already has many fascinating ML algorithms and useful utilities

• + High barriers to add newer algorithms in MLlib

Whyʼs Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Page 44: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

44Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.1

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.1–DskipTests$ ls target/*spark*target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar...

Page 45: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

45Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.0

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.0–DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

Page 46: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

46Copyright©2016 NTT corp. All Rights Reserved.

• 1. Fetch training and test data• 2. Load these data in Spark• 3. Build a model• 4. Do predictions

4 Step Example

Page 47: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

47Copyright©2016 NTT corp. All Rights Reserved.

• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

1. Fetch training and test data

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2

Page 48: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

48Copyright©2016 NTT corp. All Rights Reserved.

2. Load data in Spark

// Download Spark-v2.1 and launch a spark-shell with Hivemall$ <HIVEMALL_HOME>/bin/spark-shell

// Create DataFrame from the bzipʼd libsvm-formatted filescala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")

scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)

Page 49: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

49Copyright©2016 NTT corp. All Rights Reserved.

2. Load data in Spark (Detailed)

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDfPartition1 Partition2 Partition3 PartitionN

Load in parallel becausebzip2 is splittable

Page 50: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

50Copyright©2016 NTT corp. All Rights Reserved.

3. Build a model - DataFrame

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")

.groupBy("feature”)

.agg("weight" -> "avg”)

Page 51: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

51Copyright©2016 NTT corp. All Rights Reserved.

3. Build a model - SQL

scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""

| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature

""".stripMargin)

Page 52: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

52Copyright©2016 NTT corp. All Rights Reserved.

4. Do predictions - DataFrame

scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))

Page 53: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

53Copyright©2016 NTT corp. All Rights Reserved.

4. Do predictions - SQL

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""

| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid

""".stripMargin)

Page 54: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

54Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank",

rank().over(partitionBy($"group”).orderBy($"score".desc))).where($"rank" <= topK)

Use Spark vanilla APIs

Page 55: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

55Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank", rank().over(partitionBy($"group”).orderBy($"score".desc))

).where($"rank" <= topK)

1. Join leftDf with rightDf

2. Sort data in each group

3. Filter top-K data

Use Spark vanilla APIs

Page 56: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

56Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.top_k_join(lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),(leftDf(“x”) + rightDf(“y”)).as(“score”)

)

Use a fused API implemented in Hivemall

Page 57: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

57Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

leftDf

rightDf

・・・

Page 58: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

58Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

・・・

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

leftDf

rightDf

Page 59: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

59Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

・・・

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

Only join top-K rowsleftDf

rightDf

Page 60: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

60Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

Spark Planner (Catalyst) Overview

Page 61: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

61Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

Codegen

A Physical Plan

Page 62: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

62Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

scala> topkDf.explain== Physical Plan ==*ShuffledHashJoinTopK 1, [group#10], [group#27]:- Exchange hashpartitioning(group#10, 200): +- LocalTableScan [group#10, x#11]+- Exchange hashpartitioning(group#27, 200)

+- LocalTableScan [group#27, y#28]

ʻ*ʼ in the head means a codegenʼd plan

Page 63: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

63Copyright©2016 NTT corp. All Rights Reserved.

• Benchmark Result

Add New Features in Spark

~13 times faster than vanilla APIs!!

Page 64: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

64Copyright©2016 NTT corp. All Rights Reserved.

• Support more useful functions• Spark only implements naive and basic functions in

terms of usability and maintainability• e.g.,)

• f latten - f latten a nested schema into f lat one• from_csv/to_csv – interconversion of CSV str ings and structured data with schemas• ...

• See more in the Hivemall user guide• https://hivemal l . incubator.apache.org /userguide /spark /misc /misc.html

Add New Features in Spark

Page 65: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

65Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))

Page 66: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

66Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))

Page 67: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

67Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

Page 68: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

68Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)

) .where($"rank" <= topK)

Page 69: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

69Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark

Improve Some Functions in Spark

scala> df.each_top_k(topK, “key”, “score”, “value”)

Page 70: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

70Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

~4 times faster than rank!!

Page 71: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

71Copyright©2016 NTT corp. All Rights Reserved.

• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel

for cross-validation

Support 3rd Party Library in Spark

Page 72: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

2017/5/16ApacheBigDataNorthAmerica'17,Miami 72

WewelcomeyourcontributionstoApacheHivemallJ

HiveQL SparkSQL/Dataframe API PigLatin

Page 73: Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

2017/5/16ApacheBigDataNorthAmerica'17,Miami 73

Anyfeaturerequestorquestions?