Upload
marton-balassi
View
499
Download
0
Embed Size (px)
Citation preview
Implementing BigPetStore A blueprint for Flink users
Márton [email protected] / @MartonBalassi
Hungarian Academy of Sciences
Outline
• BigPetStore model• Data generator with the DataSet API• ETL with the DataSet & Table API• Matrix factorization with FlinkML• Recommendation with the DataStream API• Summary
BigPetStore
Blueprints for Big Data applicationsConsists of:• Data Generators
• Examples using tools in Big Data ecosystem to process data
• Build system and tests for integrating tools and multiple JVM languages
Part of the Apache BigTop project
BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
Data generation
val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
• Use RJ Nowling’s Java generator classes• Write transactions to JSON
ETL with the DataSet API
val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId
val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,
p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct
customerAndProductPairs.writeAsCsv(output)
• Read the dirty JSON• Output (customer, product) pairs for the
recommender
ETL with the Table API
val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]
• Read the dirty JSON• SQL style queries
A little Recommeder theory
Item factors
User side information User-Item matrixUser factors
Item side informatio
n
U
I
PQ
R
• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)
• Read the (customer, product) pairs• Write P and Q to file
Recommendation with the DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();
• Get the user’s row for a userID• Compute the distributed TopK of the
user’s row Q
Summary
• Go beyond WordCount with BigPetStore• Feel free to mix the DataSet, DataStream,
FlinkML, Table APIs in your Flink workflows• Data generation, cleaning, ETL, Machine
learning, streaming prediction on top of one engine with under 500 lines of code
• Java and Scala APIs work well together• A Flink pet project is always fun. No pun intended.
Big thanks to
• The BigPetStore folks:Suneel MarthiRonald J. NowlingJay Vyas
• Squirrels helping with the code:Gyula FóraGábor GevayGábor HermannFabian HueskeAljoscha Krettek
• And to the whole Flink community
Check out the code
https://github.com/mbalassi/bigpetstore-flink
Márton [email protected] / @MartonBalassi
Hungarian Academy of Sciences