Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March 1 2015 Geoffrey Fox [email protected]

Big Data Open Source Software and Projects

ABDS in Summary XIX: Layer 14B Data Science Curriculum

March 1 2015

Geoffrey Fox [email protected] http://www.infomall.org

School of Informatics and ComputingDigital Science Center

Indiana University Bloomington

mailto:[email protected]

http://www.infomall.org/

Functionality of 21 HPC-ABDS Layers1) Message Protocols:2) Distributed Coordination:3) Security & Privacy:4) Monitoring: 5) IaaS Management from HPC to hypervisors:6) DevOps: 7) Interoperability:8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management

B) NoSQLC) SQL

12) In-memory databases&caches / Object-relational mapping / Extraction Tools13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI:14) A) Basic Programming model and runtime, SPMD, MapReduce:

B) Streaming:15) A) High level Programming:

B) Application Hosting Frameworks16) Application and Analytics: 17)Workflow-Orchestration:

Here are 21 functionalities. (including 11, 14, 15 subparts)

4 Cross cutting at top17 in order of layered diagram starting at bottom

Apache Storm

• https://storm.incubator.apache.org/• Apache Storm is a distributed real time computation framework for

processing streaming data.• Storm is being used to do real time analytics, online machine

learning, distributed RPC etc.• Provides scalable, fault tolerant and guaranteed message processing.• Trident is a high level API on top of Storm which provides functions

like stream joins, groupings, filters etc. Also Trident has exactly-once processing guarantees.

• The project was originally developed at Twitter for processing Tweets from users and was donated to ASF in 2013.

• Storm has being used in very large deployments in Fortune 500 companies like Twitter and Yahoo.

Apache Samza (LinkedIn)

• http://samza.incubator.apache.org/• Similar to Apache Storm, Apache Samza is a distributed real

time computation framework for processing streaming data.• Apache Samza is built on top of Apache Kafka and Apache

Yarn. Samza uses Kafka as its messaging layer and Yarn for managing the cluster of nodes with Samza processes.

• Samza is scalable, fault tolerant and provides guaranteed message processing.

• Samza was originally developed at LinkedIn and was donated to ASF in 2013

http://samza.incubator.apache.org/

http://samza.incubator.apache.org/

Apache S4

• http://incubator.apache.org/s4/• Apache S4 is a distributed real time computation framework

for processing unbounded streams of data.• Unlike Storm and Samza S4 provides a key value based system

for processing data• The system is scalable, fault tolerant and provides guaranteed

message processing.• S4 was originally developed at Yahoo and was donated to ASF

in 2011• S4 isn’t popular as Apache Storm

http://incubator.apache.org/s4/

http://incubator.apache.org/s4/

Granules• http://granules.cs.colostate.edu/• This builds on NaradaBrokering (Layer 13)

and started at Indiana University (now at Colorado State) led by Shrideep Pallickara

• Supports long running, stateful iterative computations with science enhancements to MapReduce with data streaming in.

• Runs on HPC or Clouds or distributed resources with C, C++,C#, Java, R, and Python

http://granules.cs.colostate.edu/

http://granules.cs.colostate.edu/

Databus (LinkedIn)

• Closed source Databus http://data.linkedin.com/projects/databus • Databus provides a timeline-consistent stream of change capture events for a database.

It enables applications to watch a database, view and process updates in near real-time. • Databus provides a complete after-image of every new/changed record as well as

deletes, while maintaining timeline consistency and transactional boundaries. • The application integration is decoupled from the source database, and each application

integration is isolated, which allows for parallel development and rapid innovation.• Databus has a few key parts:

– a database connector to watch changes and maintain a clock or sequence value– an in-memory relay that keeps recent changes for efficient retrieval– a bootstrap service/database that enables long lookback queries (including from the beginning

of time)– a client that provides a simple API to get changes since a point in time

• To use databus, the consuming application simply maintains a high watermark, and periodically requests all changes since that point in time using the Databus client. Each consuming application maintains its own high watermark, which provides isolation from one another

http://data.linkedin.com/projects/databus

http://data.linkedin.com/projects/databus

Google MillWheel

• http://research.google.com/pubs/pub41378.html• MillWheel is a distributed real time computation framework by

Google.• Provides scalable, fault tolerant and exactly once message processing

guarantees.• The key data abstraction of the MillWheel is Key-Value pairs and data

is processed in a directed acyclic graph where nodes are the computation nodes.

• The project is not open source and is planned to be available to general public through Google Cloud platform as a SaaS.

• Similar functionality to Apache Storm• Part of Google Cloud Dataflow http://

googlecloudplatform.blogspot.com/2014/06/sneak-peek-google-cloud-dataflow-a-cloud-native-data-processing-service.html that also has Google Pub-Sub and FlumeJava

• See Amazon Kinesis http://aws.amazon.com/kinesis/ which combines Pub-Sub and Apache Storm capabilities

http://research.google.com/pubs/pub41378.html

http://research.google.com/pubs/pub41378.html

http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-google-cloud-dataflow-a-cloud-native-data-processing-service.html



http://aws.amazon.com/kinesis/

http://aws.amazon.com/kinesis/

Facebook Puma/Ptail/Scribe/ODS• Facebook Insights tool gives content developers an interactive

web portal that presents business analytics related to their Social plug-ins with only 30 seconds of latency, handling 20 billion events a day (2012) and uses 4 subsystems below

• Scribe: aggregating streaming log data.– https://github.com/facebookarchive/scribe/wiki

• ODS: Real-time monitoring system built on Hbase and stores data produced by Scribe– http://cdn.oreillystatic.com/en/assets/1/event/85/Facebook%E2%80%99s%20Larg

e%20Scale%20Monitoring%20System%20Built%20on%20HBase%20Presentation.pdf

– http://cloud.pubs.dbs.uni-leipzig.de/sites/cloud.pubs.dbs.uni-leipzig.de/files/RealtimeHadoopSigmod2011.pdf

– Time series data for real-time monitoring and trends, Collects metrics from each server, Aggregates in useful ways, Detects and alerts on anomalies

• Puma: a real time stream processing system batches data in memory– http://

www.slideshare.net/cloudera/building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase-jonathan-gray-facebook

– http://www.cs.duke.edu/~kmoses/cps516/puma.html • Data is read from the log files using Ptail, which is an internal tool built to aggregate

data from multiple Scribe stores. It tails (fetches last data out) the log files and pulls data out.

https://github.com/facebookarchive/scribe/wiki

https://github.com/facebookarchive/scribe/wiki

http://cdn.oreillystatic.com/en/assets/1/event/85/Facebook%E2%80%99s%20Large%20Scale%20Monitoring%20System%20Built%20on%20HBase%20Presentation.pdf



http://cloud.pubs.dbs.uni-leipzig.de/sites/cloud.pubs.dbs.uni-leipzig.de/files/RealtimeHadoopSigmod2011.pdf

http://cloud.pubs.dbs.uni-leipzig.de/sites/cloud.pubs.dbs.uni-leipzig.de/files/RealtimeHadoopSigmod2011.pdf

http://www.slideshare.net/cloudera/building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase-jonathan-gray-facebook



http://www.cs.duke.edu/~kmoses/cps516/puma.html

http://www.cs.duke.edu/~kmoses/cps516/puma.html

Azure Stream Analytics I• Microsoft Azure has several streaming solutions:• Stream Analytics is is a SQL language based implementation for querying streaming data.

The inputs to Stream Analytics comes from Event Hubs, which are a service bus/message broker type offering where users can send their events to and subscribe to events. – Service Bus is used as an publish-subscribe broker in event hub. – Stream Analytics can subscribe to event hub to receive the event streams.– Microsoft have extended SQL language to support stream querying.– http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-get-started/ – http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-real-time-event-proces

sing-reference-architecture/

– http://azure.microsoft.com/en-us/documentation/services/service-bus/ • Azure is also offering Storm as a dedicated service.

– http://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-overview/ • Language support - Storm offers a more diversified set of languages whereas Azure Stream

Analytics supports only a SQL language very similar to that provided with SQL Server. • Deployment model – Storm runs on dedicated HDInsight clusters, whereas Azure Stream

Analytics has built-in multi-tenancy support. • Data Interface – Azure Stream Analytics has first party click & configure support for Event

Hub, Azure Blob Storage, and Azure SQL Database. Storm has ingestion from Azure Event Hub, Azure Service Bus, and Apache Kafka amongst others, as well as data egress to Apache Cassandra, HDFS and SQL Azure Database.

http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-get-started/

http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-get-started/

http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-real-time-event-processing-reference-architecture/



http://azure.microsoft.com/en-us/documentation/services/service-bus/

http://azure.microsoft.com/en-us/documentation/services/service-bus/

http://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-overview/

http://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-overview/

Azure Stream Analytics II

• http://download.microsoft.com/download/6/2/3/623924DE-B083-4561-9624-C1AB62B5F82B/real-time-event-processing-with-microsoft-azure-stream-analytics.pdf

http://download.microsoft.com/download/6/2/3/623924DE-B083-4561-9624-C1AB62B5F82B/real-time-event-processing-with-microsoft-azure-stream-analytics.pdf




Documents

Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March 1 2015 Geoffrey Fox [email protected]