Big Data Processing withSpark and AWS EMR @glomex17.10.2016MichaelLudwig
Our Architecture
2
Our Use Cases
4
Billing Pre-Aggregations
Interactive Big Data
Spark components
5
Spark 1.6, PySpark, spark-submit, DataFrames, SparkSQL, UDFs, Accumulators
Example: SparkSQL
6
EMR Cluster Startup
7
AWS Web Console AWS CLI
AWS SDKs(Python, Java, JS
etc.)
Startup parameters
8
Cluster Interaction
10
Monitoring: Spark UI
12
Monitoring: Ganglia on EMR
13
Error Troubleshooting
14
Summary§ EMR§ Easyclusterstartupandconfiguration§ Throw-Away,isolatedclusters§ Nobigupfrontinvestmentsneeded
§ Spark§ BestframeworktogetstartedwithBigdata§ Bigcommunity&fastdevelopment§ Localdevelopmenteasy
15
EMR Access Urls
17
RDD, DataFrame and DataSet
18
In-Memory Computation
20
Operations§ placeholder
21
Sample Transformations
22