Upload
chicago-hadoop-users-group
View
112
Download
3
Embed Size (px)
DESCRIPTION
The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark-streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction to the Spark stack, explain how Spark has lightening fast results, and how it complements Apache Hadoop.
Citation preview
© 2014 MapR Technologies 1© 2014 MapR Technologies
Chug Spark : Hello Spark
Mike Emerick, Senior Architect MapR
April 2014
© 2014 MapR Technologies 2
Agenda
• Introductions
• Log File enrichment
• ETL with ML
• Recommendation Engine
• Adhoc SQL Queries
• The Future case
© 2014 MapR Technologies 3
Who is Mike Emerick ?
My bio the highlights.
Architect for MapR for 2.5 years.
“creative hours at Workshop 88.”
© 2014 MapR Technologies 4
Approach to this presentation
1.No API discussion
2.Architecture features and utilization
3. Use Cases .. and Why Spark?
© 2014 MapR Technologies 5
Spark 10,000 feet
• Fundamentally Spark is an MPP.
• Can use many Storage Subsystems.(Great for development)
• RDD, Accumulators, Broadcast.
• Map Reduce +.
• Apache Spark site has
great resources
on architecture and API.
© 2014 MapR Technologies 6
Usecase : SQL Queries
• “Interactive SQL on Hadoop...”
• How does Spark make this easier?– Native Hive QL (SQL 93 ish)
– In memory and from disk
– Usually the first thought...
• Spark SQL
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
Usecase : Log file enrichment
• Why enrich my log data..?
• This is not Storm it is Batch– Similar to Hbase Async API..
• How does Spark make this easier?– Streaming API
– Sliding Windows
– SQL Hive/Shark• Connect to Hbase
– NoSQL Connectors • Hbase
© 2014 MapR Technologies 9
© 2014 MapR Technologies 10
Usecase : SQL mixing with ML
• Why are folks doing this..?
• How does Spark make this easier?– Native Machine learning Mlib
– Access to neartime Adhoc SQL queries
– R and SQL in the same place
– Bigger than in memory faster than MR
© 2014 MapR Technologies 11
© 2014 MapR Technologies 12
Usecase : Recommendation Engine
• It is a recommendation engine...
• How does Spark make this easier?– ETL and Enrichment
– Mlib makes it easy to import data.
– Mlib Training in same cluster
– NoSQL Adhoc serves recommendations
– Dynamic
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14
Use cases build in complexity
• Adoption follows a curve of complexity– Ingestion and query
– Ingestion Enrichment Query
– Ingestion Enrichment Machine learning Query
– Ingestion Enrichment Machine learning Serving recommendations
– .....
• Spark is flattening the curve
• Why?– One framework
– Less data movement
– Access to preferred language
© 2014 MapR Technologies 15
Future state: ~ in the year 2000
• ADAM - Genomics
• GraphX – Graph is near...
• Mlib – Look for lots of work here
• PySpark – Fastest evolving
• SparkR – Just getting started
• BlinkDB – ~ Queries
• OEM...
© 2014 MapR Technologies 16
Business ServicesMapR is hiring in Chicago
Apache Drill Beta this Summer
Happy National Making day !
Check out W88 for Hadoop classes