23
Peeling the Onion How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid November 2016 Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

Peeling the OnionHow Data Abstractions Help Build BigData Apps

Andreas Neumann @caskoid

November 2016

Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

Page 2: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co2

Abstractions are Everywhere

Page 3: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co3

Piet Mondrian, Trafalgar Square, Broadway Boogie Woogie

Page 4: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co4

The Case for Abstractions

Abstraction is a mental process we use when trying to discern what is essential or relevant to a problem.

Tom G. Palmer

Page 5: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co5

Common Abstractions in Computing- Programming Languages

- Assembler > C > C++ > Java > Scala > ? - Memory management, Concurrency, Closures, …

- Web App Servers - CGI-bin > Servlets > JAX-RS - Connection Pools, Security, ...

- Relational Databases - Primitive types -> Semi-structured -> ORM - Transactions, rollback, isolation

Page 6: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co6

Abstractions in Hadoop- MapReduce

- Input/OutputFormat provides some kind of abstraction - Intermediate data (mapper output) must Writable

- HBase

- Row/column keys and values are byte[] - Client must implement encoding of higher level types

- Transactions: Isolation, Consistency

- Existing data abstractions for Hadoop - Apache Hive, Apache Phoenix, …

Page 7: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co7

Layers of Abstractions

engine

capa injecbility tion

dat hara s ing

int atiegr ons

enc ulaaps tion

acc attess p erns

con tensis cy

iso tila on

sto forrage mat

esch ma

Page 8: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co8

Storage Engine Abstraction- Storage Engine

- Physical Storage Medium - Lowest level of the abstraction stack

- Benefits - Application code not “polluted” with low-level storage APIs - Portability across storage engines - Portability across different version of the storage engine - Testability in environments with different storage engine - Reusability of code

Page 9: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co9

Storage Format Abstraction- Representation of data in the storage engine

- Serialization of data types to native storage format - Mapping complex types to storage format (ORM) - Schema representation - Provided partially by some storage engines SQL)

- Benefits - Application is not concerned with serialization/deserialization - Schema evolution - Enforces correct schema and representation

Page 10: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co10

Consistency Abstractions- Strong vs. Eventual Consistency - Transactional (ACID) consistency

- Protect data from concurrent modification - Isolation / visibility guarantees - Optimistic Concurrency Control: Handling conflicts

- Benefits - Application code not concerned with consistency - “Framework level correctness”

Page 11: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co11

Data Sharing Abstractions- Sharing/Reusing data across programming paradigms

- Write with Spark Streaming, query with SQL - Share data between batch (MapReduce) and realtime streaming - Data as a s Service (DaaS)

- Benefits - No data silos - Less redundancy in data access

Page 12: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co12

Data Access Pattern Abstractions- Encapsulation of common data access patterns

- Examples: - Indexed Table - TimeSeries - Cube

- Benefits - Cleaner application code - Enforcement of best practices - Avoid data corruption - Separation of concerns/responsibilities

Page 13: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co13

Capability Injection- Framework level Enterprise capabilities

- Metrics - Meta Data - Lineage, Access Audit Trail, Usage stats - Access Control

- Benefits - Operational Capabilities solved at the framework level - Compliance, Governance

Page 14: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co14

The Cost of Abstraction

First you learn the value of abstraction, then you learn the cost of abstraction,

then you're ready to engineer. Kent Beck

Page 15: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co15

Clean Cut Abstractions

con tensis cysto forrage mat

engine

enc ulaaps tiondat hara s ing

capa injecbility tion

Page 16: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co16

Abstractions Gone Wrong

Page 17: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co17

Fried Abstractions

Page 18: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co18

What Makes a Good Abstraction- Minimal Overhead

- Injection happens once - Not in critical path / inner loop

- Not more code - Separation from app code - Reusability

- Storage Optimization - May not expose all the knobs and dials of the storage engine - Allow to bypass the abstraction when necessary

Page 19: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co19

• Application Development and Management

• Provides Data and Programming Abstractions

• Provides Integrations

• Data-As-A-Service

• Empower developers

• Simple Access to Powerful Tech

• WYSIWYG Data Pipelines • Streaming• Batch

• Ingestion, Transformation, Blending (complex joins) and Lookup.

• Machine Learning, Aggregation and Reporting

• Connectors for varied sources and sinks

• Easy way to catalog application and pipeline level metadata

• Search across technical, business and operational metadata

• Track Lineage and Provenance,

• Data Quality Measure

• Integration with other MDM systems

Page 20: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co20

Data Abstractions In Practice- Use Case:

- Ingest from Twitter into a Dataset - Run MapReduce over the Dataset to compute frequent #hashtags - Service to retrieve the top #hashtags - See the lineage for this Dataset

Page 21: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co21

Demo

Page 22: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

cask.co22

Conclusion

Brevity is the soul of wit. William Shakespeare

Page 23: BDAM Data Abstractionscustomers.cask.co/rs/882-OYR-915/images/BDAM 11-9_Data... · 2020. 8. 23. · How Data Abstractions Help Build BigData Apps Andreas Neumann @caskoid ... - Apache

Thank [email protected]

@CaskData

github.com/caskdata/cdapgithub.com/caskdata/hydrator-plugins

Questions?23