23

Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind
Page 2: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Apache Apex Unified Batch and Stream Processing for Big Data

Milind Barve

Nov. 03, 2015

Page 3: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Project History

• Project development started

in 2012 at DataTorrent

• Open-sourced in July 2015

• Apache Apex started incubation in August 2015

Page 4: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Project Status

Mentor ListTed Dunning: Apache Member, MapRAlan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks

Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks

Apex In Apache Incubation Stage

Page 5: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Apache Apex (Incubating) Committer List

Over 50 committers already…And growing….

Page 6: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

What we will serve you today …

– Batch & Streaming-Two worlds collide??

– Apex Engine- all the nerdy features

– Questions, you still have some???

– Develop your first app on Apex …

Page 7: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Batch Layer

Speed Layer

Serving Layer

master dataset

real time view

real time view

batch view

query

query

Lambda Architecture

Page 8: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Aggregate Layer

master dataset

Incremental Layer

aggregate query

incremental dataset

Aggregate View

Apex Real-time Unified Architecture

Page 9: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Aggregate Layer

master dataset

Incremental Layer

rolling query

aggregate query

incremental dataset

Aggregate View

Incremental View

Apex Real-time Unified Architecture

Page 10: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Apex Platform Overview Enterprise Edition

Page 11: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Apache Apex-Malhar

Page 12: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Directed Acyclic Graph (DAG)

Application Programming Model

• A Stream is a sequence of data tuples

• An Operator takes one or more input streams, performs computations & emits one or more output streams• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library

• Operator has many instances that run in parallel and each instance in single-threaded

• Directed Acyclic Graph (DAG) is made up of operators and streams

Output StreamTuple Tupleer

Operator

er

Operator

er

Operator

er

Operator

Application Programming Model

Page 13: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Hadoop Edge Node

DT RTS Management

Server

Hadoop Node

YARN Container

Apex App Master

Hadoop Node

YARN ContainerYARN Container

YARN Container

Thread1

Op2

Op1

Thread-N

Op3

Streaming Container

Hadoop Node

YARN ContainerYARN Container

YARN Container

Thread1

Op2

Op1

Thread-N

Op3

Streaming Container

CLI

REST API

DT RTS Management

Server

REST API

Part of Community Edition

Apex Component Overview

Page 14: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Apex Engine

Core Features

Page 15: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

• YARN is the resource manager

• HDFS used for storing any persistent state

Native Hadoop Integration

Page 16: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Partitioning & Scaling built-in

• Operators can be statically/dynamically scaled

• Flexible Streams split

• Parallel partitioning

• MxN partitioning

• Unifiers

Partitioning and Scaling Out

Page 17: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Advanced Windowing support

• Application window

• Sliding window and tumbling window

• Checkpoint window

• No artificial latency

Advanced Windowing Support

Page 18: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

• Supported out of the box– Application state

– Application master state

– No data loss

• Automatic recovery

• Lunch test

• Buffer server

Stateful Fault Tolerance

Page 19: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

• AT_LEAST_ONCE (default): – Windows are processed at least once

• AT_MOST_ONCE: – Windows are processed at most once

• During recovery, all downstream operators are fast-forwarded to the window of latest checkpoint

• EXACTLY_ONCE: – Windows are processed exactly once

• Checkpoint every window• Checkpointing becomes blocking

Processing Semantics

Page 20: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Data locality• Stream locality for placement of operators

– Rack local – Distributed deployment

– Node local – Data does not traverse NIC

– Container local – Data doesn’t need to be serialized

– Thread local – Operators run in same thread

Compute Locality

Page 21: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

• Dynamic topology updates

– Properties of operators can be changed

– New operators

• Upcoming

– Update attributes

Dynamic Updates

Page 22: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

© 2014 DataTorrent Confidential – Do Not Distribute

For more Info …

• Mailing List: [email protected]

• Apache Apex: http://apex.apache.org/

• Github

ᵒ Apex Core: http://github.com/apache/incubator-apex-core

ᵒ Apex Malhar: http://github.com/apache/incubator-apex-malhar

• DataTorrent: http://www.datatorrent.com

Page 23: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind

Thank You

Please send your questions at [email protected]