Upload
blidiselalin
View
166
Download
0
Embed Size (px)
Citation preview
Alin Blidisel - Spark: Big Data Beyond MapReduce
ALIN BLIDISEL - SPARK: BIG DATA BEYOND MAPREDUCE
Apache Spark: Introduction,
Examples, Data Analysis and
Statistics.
Blidisel Alin
Alin Blidisel - Big Data: Beyond MapReduce
WHY SPARK?
Hadoop Spark
Alin Blidisel - Big Data: Beyond MapReduce
SPARK - INTRODUCTION- was created by Matei Zaharia at Berkley
- was introduced by Apache Software Foundation for speeding up the Hadoop computational process
- is not a modified version of Hadoop
- in-memory cluster computing
- own cluster computation management
- designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming
Alin Blidisel - Big Data: Beyond MapReduce
SPARK COMPONENTS
Alin Blidisel - Big Data: Beyond MapReduce
FEATURES OF APACHE SPARK
- Lighting Fast Processing (10 to 100 faster then Hadoop)
- Ease of Use as it supports multiple languages
- Support for Sophisticated Analytics
- Real Time Stream Processing
- Ability to Integrate with Hadoop and Existing HadoopData
- Active and Expanding Community (more than 250 developers have contributed to Spark already)
Alin Blidisel - Big Data: Beyond MapReduce
RESILIENT DISTRIBUTED DATASETS (RDDS)
- fault-tolerant collection of elements that can be operated on in parallel (distributed and immutable)
- two ways to create RDDs:- parallelizing an existing collection in your driver program
- referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat
- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)
Alin Blidisel - Big Data: Beyond MapReduce
SPARK CLUSTER MODE OVERVIEW
Alin Blidisel - Big Data: Beyond MapReduce
SPARK USER INTERFACE
Alin Blidisel - Big Data: Beyond MapReduce
EXAMPLE: DATA ANALYSIS Sample Data from Sales transactions CSV file
Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.1166667
1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194
1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.1333333,144.75
1/4/09 12:56,Product2,3600,Visa,Gerd W ,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025
1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806
1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889
1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028
1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.0666667,34.7666667
Alin Blidisel - Big Data: Beyond MapReduce
LOAD ORIGINAL CSV FROM HDFSCreate Spark Context and define input parameters
Create RDD from CSV file
Alin Blidisel - Big Data: Beyond MapReduce
GET RANDOM DATA AND CREATE A DATAFRAME
Alin Blidisel - Big Data: Beyond MapReduce
DETERMINE FIELD TYPES
Alin Blidisel - Big Data: Beyond MapReduce
CREATE NEW DATAFRAME BASED ON THE NEW DETERMINED FIELD TYPES
Alin Blidisel - Big Data: Beyond MapReduce
SAVE DATA IN PARQUET FORMAT
This is the new updated schema
Alin Blidisel - Big Data: Beyond MapReduce
GENERATE STATISTICS
© 2016 Atigeo, Corporation. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Thank you!