Upload
flink-forward
View
6.103
Download
3
Embed Size (px)
Citation preview
BigPetStore-FlinkA Comprehensive Blueprint for Apache
Flink. Suneel Marthi
Flink Forward 2015, Berlin
About Me• Senior Principal Engineer, Office of Technology, Red Hat• Committer and PMC member on Apache Mahout• Contributor to DeepLearning4J and Oryx 2.0• Co-Organizer of Washington DC Apache Flink Meetup• Founder of Boston Apache Flink Meetup
Outline Of Talk• What is BigPetStore?• Why BigPetStore?• Synthetic Data• BigPetStore - MapReduce, Spark• BigPetStore - Flink• Future possibilities
What is BigPetStore?• Blueprints for Big Data
applications• Consists of:
– Data Generators– Examples using tools in
Big Data ecosystem to process data
– Build system and tests for integrating tools and multiple JVM languages
• Part of Apache Bigtop• Used for:
– Templates for infrastructure (build, integration, testing)
– Educational examples– Testing– Demos– Benchmarking
Why BigPetStore?(1)As a developer, I want an application blueprint that…• scales to a size approximating my data-domain• includes idiomatic unit and integration testing• demonstrates ETL as well analyticsIn other words…Word count was great for MapReduce, but we need something more to demonstrate the advanced capabilities of newer processing engines
Why BigPetStore?(2)PetStores have been around for a while to showcase different technologies starting with Sun’s Web Petstore in the early days of J2EE
Everyone knows what a PetStore is, hence it’s intuitive to non-developers
Vision• Bigtop Data Generators - a resource for all Apache
projects!
• To build more sophisticated blueprints for users and developer
• Useful for smoke testing infrastructure and applications!
Case for Synthetic Data• Most company Data is private and confidential• Licensing concerns with sharing the data• Secure data cannot be moved out of production• Enable more realistic example applications• Enable more comprehensive testing than regular
wordcount or TeraSort
Bigtop Data Generators• BigPetStore Data Generator• Bigtop Weatherman• Bigtop Bazaar• Locations Library• Sampler Library• Name Generator• Product Generator
BigPetStore-Mapreduce (BIGTOP-1270)
• Originally, a MapReduce application for demonstrating Mapreduce, Pig, Mahout.
• Primitive “hierarchical” data generator for generating fake petstore transaction (at any scale).
• Part of ASF Bigtop and at Red Hat, and other companies, for testing the Hadoop ecosystem.
New Data Generator for BigPetStore
• Motivation: realistic ML/analytics examples• Goal: More complex patterns embedded in data• Mathematical modeling and simulation
– Sampling from PDFs– (Hidden) Markov Models– Poisson processes– Stochastic differential equations
Next Step: A Platform Independent Data Generator.
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
BigPetStore Data Model• Generative Model leveraging well-known mathematical
modeling techniques to simulate factors influencing customers’ purchasing habits.
• Several cases real data is used to parameterize the model
BigPetStore-TransactionQueue• no need for API calls, just use docker• Generate load for any app: Not just JVM apps.• docker run -t -i smarthi/bigpetstore-transaction-queue
BigPetStore-Spark (BIGTOP-1535)
-RJ Nowling rewrote the BigPetStore data generator components to generate more complex data sets, with patterns varying in many dimensions.-BigPetStore-Spark was then added to ASF BigTop, demonstrating that the data generator could be used in a distributed context.
BigPetStore-Flink (Bigtop-1927 & Bigtop-1928)
• A Flink application blueprint.• Generates data at any scale.• Uses Flink streams to write generated data to disk.• Uses Flink DataStream transformations to transform data
sets for analytics.
Future Endeavors• How to help users build their own models?• How to use the Bigtop Data Generators for load testing?• How to produce synthetic copies from real datasets?• Better libraries and abstractions to reduce boilerplate• Research: Investigating Probabilistic Programming
Languages which provide advanced sampling and inference algorithms combined with high-level DSLs for model specifications
Future: BigPetStore - Flink A BigPetStore Blueprint for:• Flink Batch• Flink Table API• Flink ML algorithms
ResourcesNowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
https://github.com/apache/bigtop/tree/master/bigtop-data-generators
https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore
BigTop Data Generators available as a library:http://dl.bintray.com/rnowling/bigpetstore
TL;DR• BigTop Data Generators - a resource for all Apache BigData projects
• Comprehensive Blueprints• Smoke and integration testing• Load testing
• Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 & BIGTOP-1928)
• Future Endeavors• Expand BigPetStore Flink as new Flink features become available• Make models easier to build• Easier ways to generate synthetic data from models built on real data