34
Radu Chilom [email protected] In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta [email protected]

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

Radu Chilom

[email protected]

In-memory data pipeline and warehouse at scale using Spark, Spark SQL,

Tachyon and Parquet

Ema Iancuta

[email protected]

Page 2: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Big data analytics / machine learning

• 6+ years with Hadoop ecosystem

• 2 years with Spark

• http://atigeo.com/

• A research group that focuses on the technical problems that exist in the big data industry and provides open source solutions • http://bigdataresearch.io/

Page 3: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Intro

• Use Case

• Data pipeline with Spark

• Spark Job Rest Service

• Spark SQL Rest Service (Jaws)

• Parquet

• Tachyon

• Demo

Agenda

Page 4: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Build an in memory data pipeline for millions financial transactions used downstream by data scientists for detecting fraud • Ingestion from S3 to our Tachyon/HDFS

cluster • Data transformation • Data warehouse

Use Case

Page 5: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• “fast and general engine for large-scale data processing” • Built around the concept of RDD • API for Java/Scala/Python (80 operators)

• powers a stack of high level tools including Spark SQL, MLlib, Spark Streaming.

Apache Spark

Page 6: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Public S3 Bucket: public-financial-transactions

public-financial-transactions (s3-bucket)

scheme scheme.csv

data input-0.csv

data2

input-1.csv

. . .

. . .

Page 7: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Download from S3

1. Ingestion

• Resolving the wildcards means listing files metadata

• Listing the metadata for a large number of files from external sources can take a long time

Page 8: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Listing the metadata (distributed)

Driver

Worker Worker Worker

folder1 folder2 folder3 folder4 folder5 folder6

folder1 folder2

folder3 folder4

folder5 folder6

file-11 file-12 file-21 file-22 file-23

file-31 file-32 file-41 file-42 file-43 file-44

file-51 file-52 file-61

Page 9: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Listing the metadata (distributed)

• For fine tuning, specify the number of partitions

Page 10: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Unbalanced partitions

Download Files

Page 11: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Unbalanced partitions

Partition 0

transactions.csv

Partition 1

input.csv data.csv

values.csv buzzwords.csv buzzwords.txt

Page 12: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Balancing partitions

Partition 0

(0, transactions.csv) (2, data.csv)

(4, buzzwords.csv)

Partition 1

(1, input.csv) (3, values.csv)

(5, buzzwords.txt)

Page 13: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Balancing partitions

Keep in mind that repartitioning your data is a fairly expensive operation.

Balancing partitions

Page 14: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Data cleaning is the first step in any data science project

• For this use-case: - Remove lines that don't match the structure - Remove “useless” columns - Transform data to be in a consistent format

2. Data Transformation

Page 15: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Join

Find Country char code

Numeric Format Alpha 2 Format

276 DE

NameGermany

• Problem with skew in the key distribution

Page 16: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Metrics for Join

Page 17: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Broadcast Country Codes Map

Find Country char code

Page 18: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Metrics

Page 19: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Transformation with Join vs Broadcasted Map

(skewed key)

Seco

nds

0

60

120

180

240

300

Rows

1 Million 2 Million 3 Million

Join Broadcasted Map

Page 20: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Supports multiple contexts • Launches a new process for each Spark context • Inter-process communication with Akka actors • Easy context creation & job runs • Supports Java and Scala code • Friendly UI

Spark-Job-Resthttps://github.com/Atigeo/spark-job-rest

Page 21: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• Hive • Apache Pig • Impala • Presto • Stinger (Hive on Tez) • Spark SQL

Build a data warehouse

Page 22: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Spark SQL

• Support for multiple input formats

• Rich language interfaces

• RDD-aware optimizer

RDD

DataFrame / SchemaRDD

JDBC

HIVE QL SQL

Page 23: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Creating a data frame

Page 24: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Perform a simple query:Explore data

> Directly on the data frame

> Registering a temporary table

- select

- filter

- join

- groupBy

- agg

- join

- count

- sort

- where ..etc.

Page 25: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Creating a data warehouse

https://github.com/Atigeo/xpatterns-spark-parquet

Page 26: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• TextFile

• SequenceFile

• RCFile (RowColumnar)

• ORCFile (OptimizedRowColumnar)

• Avro

• Parquet

File Formats

> columnar format > good for aggregation queries > only the required columns are read from disk > nested data structures > schema with the data > spark sql supports schema evolution > efficient compression

Page 27: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Tachyon• memory-centric distributed file system

enabling reliable file sharing at memory-speed across cluster frameworks

• Pluggable underlayer file system: hdfs, S3,…

Page 28: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Caching in Spark SQL

• Cache data in columnar format • Automatically compression tune

Page 29: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

• spark context might crash

Spark cache vs Tachyon

• GC kicks in

• share data between different applications

Page 30: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

- Highly scalable and resilient data warehouse

- Submit queries concurrently and asynchronously

- Restful alternative to Spark SQL JDBC having a interactive UI

- Since Spark 091 with Shark

- Support for Spark SQL and Hive - MR (and more to come)

https://github.com/Atigeo/jaws-spark-sql-rest

Jaws spark sql rest

Page 31: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

- Akka actors to communicate through instances

- Support cancel queries

- Supports large results retrieval

- Parquet in memory warehouse

- returns persisted logs, results, query history

- provides a metadata browser

- configuration file to fine tune spark

Jaws main features

Page 32: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

https://github.com/big-data-research/in-memory-data-pipeline

Code available at

Page 33: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

‹#›

Q & A

Page 34: In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquetfiles.meetup.com/.../InMemoryDataPipelineAndWarehouse.pdf · 2015-06-15 · In-memory data

© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.