Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

Not Your Father’s Database:How to Use Apache® Spark™ Properly in Your Big Data Architecture

Not Your Father’s Database:How to Use Apache® Spark™ Properly in Your Big Data Architecture

About Me

2005 Mobile Web & Voice Search

3

About Me


4

2012 Reporting & Analytics

About Me


5

2012 Reporting & Analytics

2014 Solutions Engineering

This system talks like a SQL Database…

Is this your Spark infrastructure?

6

HDFS

But the performance is very different…

Is this your Spark infrastructure?

7

HDFS

Just in Time Data Warehouse w/ Spark

HDFS


HDFS


and more…HDFS

Separate Compute vs. Storage

11

Benefits:• No need to import your data into Spark to begin

processing.• Dynamically Scale Spark clusters to match compute

vs. storage needs.• Choose the best data storage with different

performance characteristics for your use case.

12

Know when to use other data stores besides file systems

Today’s Goal

13

Data Warehousing

Use Case:

Good: General Purpose Processing

Types of Data Sets to Store in File Systems: • Archival Data• Unstructured Data• Social Media and other web datasets• Backup copies of data stores

14

Types of workloads• Batch Workloads• Ad Hoc Analysis

– Best Practice: Use in memory caching• Multi-step Pipelines• Iterative Workloads

15


Benefits:• Inexpensive Storage• Incredibly flexible processing• Speed and Scale

16


Bad: Random Access

sqlContext.sql(“select * from my_large_table where id=2I34823”)

Will this command run in Spark?

17

Bad: Random Access

sqlContext.sql(“select * from my_large_table where id=2I34823”)

Will this command run in Spark?Yes, but it’s not very efficient — Spark may have

to go through all your files to find your row.

18

Bad: Random Access

Solution: If you frequently randomly access your data, use a database.

• For traditional SQL databases, create an index on your key column.

• Key-Value NOSQL stores retrieves the value of a key efficiently out of the box.

19

Bad: Frequent Inserts

sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”)

Each insert creates a new file:• Inserts are reasonably fast.• But querying will be slow…

20

Bad: Frequent Inserts

Solution:• Option 1: Use a database to support the inserts.• Option 2: Routinely compact your Spark SQL table files.

21

Good: Data Transformation/ETL

Use Spark to splice and dice your data files any way:

File storage is cheap: Not an “Anti-pattern” to duplicately store your

data.22

Bad: Frequent/Incremental Updates

Update statements — not supported yet.

Why not?• Random Access: Locate the row(s) in the files.• Delete & Insert: Delete the old row and insert a new one.• Update: File formats aren’t optimized for updating rows.

Solution: Many databases support efficient update operations.

23

Use Case: Up-to-date, live views of your SQL tables.

Tip: Use ClusterBy for fast joins or Bucketing with 2.0.

Bad: Frequent/Incremental Updates

24

IncrementalSQL Query

Database Snapshot

+

Good: Connecting BI Tools

Tip: Cache your tables for optimal performance.

25

HDFS

Bad: External Reporting w/ load

Too many concurrent requests will start to queue up.

26

HDFS

Solution: Write out to a DB as a cache to handle load.

Bad: External Reporting w/ load

27

HDFS

DB

28

Advanced Analytics and Data Science

Use Case:

Good: Machine Learning & Data Science

Use MLlib, GraphX and Spark packages for machine learning and data science.

Benefits:• Built in distributed algorithms.• In memory capabilities for iterative workloads.• All in one solution: Data cleansing, featurization,

training, testing, serving, etc.29

Bad: Searching Content w/ load

sqlContext.sql(“select * from mytable where name like '%xyz%'”)

Spark will go through each row to find results.

30

31

Streaming and Realtime Analytics

Use Case:

Good: Periodic Scheduled Jobs

Schedule your workloads to run on a regular basis: • Launch a dedicated cluster for important workloads.• Output your results as reports or store to a

files/database.• Poor Man’s Streaming: Spark is fast, so push the

interval to be frequent.

32

Bad: Low Latency Stream Processing

Spark Streaming can detect new files dropped into a folder to process, but there is a delay to build up a whole file’s worth of data.

Solution: Send data to message queues not files.

33

Thank you

Not Your Father’s Database:How to Use Apache Spark Properly in Your Big Data Architecture

Spark Summit East 2016

Technology

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture