Upload
bouquet
View
333
Download
0
Embed Size (px)
Citation preview
February 16th 2016 [email protected]
Migrating structured data between Hadoop and RDBMS
Who am I?
• Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on the top of the highest mountains of the world
What I do ?
• Develop of an analytics toolbox. • No setup. No SQL. No compromise. • Generate SQL with a REST API.
It is open source! https://github.com/openbouquet
Topic of today
• You need Scalability? • You need a machine learning toolbox?
Hadoop is the solution.
•But you still need structured data? Our tool provide a solution.
=> We need both!
What does that mean?
• Creation of dataset in Bouquet • Send the dataset to Spark • Enrich inside Spark • Re-injection in original database
How we do it?
User input
Relational DB
SparkBouquet
Create and Send
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
User select the data. Bouquet generate the corresponding SQL Code
Kafka
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
Data is read from the SQL database
Kafka
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
Bouquet creates an avro schema and send the data to Kafka
Kafka
How does it work?
BouquetRelational DB
SparkKafka
HDFS/ Tachyon
Hive Metastore
Kafka Broker(s) receive the data
How does it work?
BouquetRelational DB
Spark
HDFS/ Tachyon
Hive Metastore
Kafka
The hive metastore is updated and the hdfs connectors writes into hdfs
Tachyon?
• Use it as in memory filesystem to replace HDFS. • Interact with Spark using the hdfs plugin. • Transparent from user point of view
How to keep the data structured?Use a schema registry (Avro in Kafka). each schema has a corresponding kafka topic and a distinct hive table.
{ "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]} ] }
Challenges
- Auto creation of topics/table in Hive for each datasets from Bouquet.
- JDBC reads are too slow for something like Kafka. - Issue with types conversion: null is not supported for all cases for example (issue 272 on schema-registry).
- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec 2015)
- Hive: Setting the warehouse directory. - In tachyon: Setting up hostname.
Technology choice
• KISS: Kafka + Spark + Tachyon. • Flexible (Hive, In-memory storage) • Easily scalable
• GemFire, SnappyData, Apache Ignite for In-memory storage. • Storm for streaming
Status
Injection DB -> Spark: OK Spark usage: OK Re-injection: In alpha stage.
Re-injection
Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function) • Bouquet pulls the data from spark
We use it for real!
Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.
In the future
• Notebook integration • We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)
QUESTIONS OPENBOUQUET.IO