View
1.562
Download
0
Category
Preview:
Citation preview
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 1/29
Fork Me on GitHub
The Hadoop Ecosystem TableThis page is a summary to keep the track of Hadoop related projects,focused on FLOSS environment.
Distributed Filesystem
Apache HDFS
The Hadoop Distributed File System (HDFS) offers away to store large files across multiple machines.Hadoop and HDFS was derived from Google FileSystem (GFS) paper. Prior to Hadoop 2.0.0, theNameNode was a single point of failure (SPOF) in anHDFS cluster. With Zookeeper the HDFS HighAvailability feature addresses this problem byproviding the option of running two redundantNameNodes in the same cluster in an Active/Passiveconfiguration with a hot standby.
1. hadoop.apache.org 2. Google FileSystem- GFS Paper 3. Cloudera WhyHDFS 4. Hortonworks WhyHDFS
Red Hat GlusterFS
GlusterFS is a scale-out network-attached storage filesystem. GlusterFS was developed originally byGluster, Inc., then by Red Hat, Inc., after their purchaseof Gluster in 2011. In June 2012, Red Hat StorageServer was announced as a commercially-supportedintegration of GlusterFS with Red Hat EnterpriseLinux. Gluster File System, known now as Red HatStorage Server.
1. www.gluster.org 2. Red Hat HadoopPlugin
Quantcast File System QFS
QFS is an open-source distributed file system softwarepackage for large-scale MapReduce or other batch-processing workloads. It was designed as an alternativeto Apache Hadoop’s HDFS, intended to deliver betterperformance and cost-efficiency for large-scaleprocessing clusters. It is written in C++ and has fixed-footprint memory management. QFS uses Reed-Solomon error correction as method for assuringreliable access to data. Reed–Solomon coding is very widely used in massstorage systems to correct the burst errors associatedwith media defects. Rather than storing three fullversions of each file like HDFS, resulting in the needfor three times more storage, QFS only needs 1.5x theraw capacity because it stripes data across ninedifferent disk drives.
1. QFS site 2. GitHub QFS 3. HADOOP-8885
Ceph Filesystem
Ceph is a free software storage platform designed topresent object, block, and file storage from a singledistributed computer cluster. Ceph's main goals are tobe completely distributed without a single point offailure, scalable to the exabyte level, and freely-available. The data is replicated, making it faulttolerant.
1. Ceph Filesystemsite 2. Ceph and Hadoop 3. HADOOP-6253
Lustre file system The Lustre filesystem is a high-performance 1. wiki.lustre.org/
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 2/29
distributed filesystem intended for larger network andhigh-availability environments. Traditionally, Lustre isconfigured to manage remote data storage disk deviceswithin a Storage Area Network (SAN), which is two ormore remotely attached disk devices communicatingvia a Small Computer System Interface (SCSI)protocol. This includes Fibre Channel, Fibre Channelover Ethernet (FCoE), Serial Attached SCSI (SAS) andeven iSCSI. With Hadoop HDFS the software needs a dedicatedcluster of computers on which to run. But folks whorun high performance computing clusters for otherpurposes often don't run HDFS, which leaves themwith a bunch of computing power, tasks that couldalmost certainly benefit from a bit of map reduce andno way to put that power to work running Hadoop.Intel's noticed this and, in version 2.5 of its Hadoopdistribution that it quietly released last week, has addedsupport for Lustre: the Intel® HPC Distribution forApache Hadoop* Software, a new product thatcombines Intel Distribution for Apache Hadoopsoftware with Intel® Enterprise Edition for Lustresoftware. This is the only distribution of ApacheHadoop that is integrated with Lustre, the parallel filesystem used by many of the world's fastestsupercomputers
2. Hadoop withLustre 3. Intel HPC Hadoop
Alluxio Alluxio, the world’s first memory-centric virtualdistributed storage system, unifies data access andbridges computation frameworks and underlyingstorage systems. Applications only need to connectwith Alluxio to access data stored in any underlyingstorage systems. Additionally, Alluxio’s memory-centric architecture enables data access orders ofmagnitude faster than existing solutions. In big data ecosystem, Alluxio lies betweencomputation frameworks or jobs, such as ApacheSpark, Apache MapReduce, or Apache Flink, andvarious kinds of storage systems, such as Amazon S3,OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS.Alluxio brings significant performance improvement tothe stack; for example, Baidu uses Alluxio to improvetheir data analytics performance by 30 times. Beyondperformance, Alluxio bridges new workloads with datastored in traditional storage systems. Users can runAlluxio using its standalone cluster mode, for exampleon Amazon EC2, or launch Alluxio with ApacheMesos or Apache Yarn. Alluxio is Hadoop compatible. This means thatexisting Spark and MapReduce programs can run ontop of Alluxio without any code changes. The projectis open source (Apache License 2.0) and is deployed atmultiple companies. It is one of the fastest growingopen source projects. With less than three years opensource history, Alluxio has attracted more than 160
1. Alluxio site
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 3/29
contributors from over 50 institutions, includingAlibaba, Alluxio, Baidu, CMU, IBM, Intel, NJU, RedHat, UC Berkeley, and Yahoo. The project is thestorage layer of the Berkeley Data Analytics Stack(BDAS) and also part of the Fedora distribution.
GridGain
GridGain is open source project licensed under Apache2.0. One of the main pieces of this platform is the In-Memory Apache Hadoop Accelerator which aims toaccelerate HDFS and Map/Reduce by bringing both,data and computations into memory. This work is donewith the GGFS - Hadoop compliant in-memory filesystem. For I/O intensive jobs GridGain GGFS offersperformance close to 100x faster than standard HDFS.Paraphrasing Dmitriy Setrakyan from GridGainSystems talking about GGFS regarding Tachyon:
GGFS allows read-through and write-throughto/from underlying HDFS or any other Hadoopcompliant file system with zero code change.Essentially GGFS entirely removes ETL stepfrom integration.GGFS has ability to pick and choose whatfolders stay in memory, what folders stay ondisc, and what folders get synchronized withunderlying (HD)FS either synchronously orasynchronously.GridGain is working on adding nativeMapReduce component which will providenative complete Hadoop integration withoutchanges in API, like Spark currently forces youto do. Essentially GridGain MR+GGFS willallow to bring Hadoop completely or partially in-memory in Plug-n-Play fashion without any APIchanges.
1. GridGain site
XtreemFS XtreemFS is a general purpose storage system andcovers most storage needs in a single deployment. It isopen-source, requires no special hardware or kernelmodules, and can be mounted on Linux, Windows andOS X. XtreemFS runs distributed and offers resiliencethrough replication. XtreemFS Volumes can beaccessed through a FUSE component, that offersnormal file interaction with POSIX like semantics.Furthermore an implementation of HadoopsFileSystem interface is included which makesXtreemFS available for use with Hadoop, Flink andSpark out of the box. XtreemFS is licensed under theNew BSD license. The XtreemFS project is developedby Zuse Institute Berlin. The development of theproject is funded by the European Commission since2006 under Grant Agreements No. FP6-033576, FP7-ICT-257438, and FP7-318521, as well as the Germanprojects MoSGrid, "First We Take Berlin", FFMK,GeoMultiSens, and BBDC.
1. XtreemFS site 2.Flink on XtreemFS .Spark XtreemFS
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 4/29
Distributed Programming
Apache Ignite
Apache Ignite In-Memory Data Fabric is a distributedin-memory platform for computing and transacting onlarge-scale data sets in real-time. It includes adistributed key-value in-memory store, SQLcapabilities, map-reduce and other computations,distributed data structures, continuous queries,messaging and events subsystems, Hadoop and Sparkintegration. Ignite is built in Java and provides .NETand C++ APIs.
1. Apache Ignite 2. Apache Ignitedocumentation
Apache MapReduce
MapReduce is a programming model for processinglarge data sets with a parallel, distributed algorithm ona cluster. Apache MapReduce was derived fromGoogle MapReduce: Simplified Data Processing onLarge Clusters paper. The current Apache MapReduceversion is built over Apache YARN Framework.YARN stands for “Yet-Another-Resource-Negotiator”.It is a new framework that facilitates writing arbitrarydistributed processing frameworks and applications.YARN’s execution model is more generic than theearlier MapReduce implementation. YARN can runapplications that do not follow the MapReduce model,unlike the original Apache Hadoop MapReduce (alsocalled MR1). Hadoop YARN is an attempt to takeApache Hadoop beyond MapReduce for data-processing.
1. ApacheMapReduce 2. GoogleMapReduce paper 3. Writing YARNapplications
Apache Pig
Pig provides an engine for executing data flows inparallel on Hadoop. It includes a language, Pig Latin,for expressing these data flows. Pig Latin includesoperators for many of the traditional data operations(join, sort, filter, etc.), as well as the ability for users todevelop their own functions for reading, processing,and writing data. Pig runs on Hadoop. It makes use ofboth the Hadoop Distributed File System, HDFS, andHadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its dataprocessing. It compiles the Pig Latin scripts that userswrite into a series of one or more MapReduce jobs thatit then executes. Pig Latin looks different from manyof the programming languages you have seen. Thereare no if statements or for loops in Pig Latin. This isbecause traditional procedural and object-orientedprogramming languages describe control flow, and dataflow is a side effect of the program. Pig Latin insteadfocuses on data flow.
1. pig.apache.org/ 2.Pig examples byAlan Gates
JAQL JAQL is a functional, declarative programminglanguage designed especially for working with largevolumes of structured, semi-structured andunstructured data. As its name implies, a primary useof JAQL is to handle data stored as JSON documents,but JAQL can work on various types of data. Forexample, it can support XML, comma-separated values(CSV) data and flat files. A "SQL within JAQL"
1. JAQL in GoogleCode 2. What is Jaql? byIBM
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 5/29
capability lets programmers work with structured SQLdata while employing a JSON data model that's lessrestrictive than its Structured Query Languagecounterparts. Specifically, Jaql allows you to select, join, group, andfilter data that is stored in HDFS, much like a blend ofPig and Hive. Jaql’s query language was inspired bymany programming and query languages, includingLisp, SQL, XQuery, and Pig. JAQL was created by workers at IBM Research Labsin 2008 and released to open source. While it continuesto be hosted as a project on Google Code, where adownloadable version is available under an Apache 2.0license, the major development activity around JAQLhas remained centered at IBM. The company offers thequery language as part of the tools suite associatedwith InfoSphere BigInsights, its Hadoop platform.Working together with a workflow orchestrator, JAQLis used in BigInsights to exchange data betweenstorage, processing and analytics jobs. It also provideslinks to external data and services, including relationaldatabases and machine learning data.
Apache Spark
Data analytics cluster computing framework originallydeveloped in the AMPLab at UC Berkeley. Spark fitsinto the Hadoop open-source community, building ontop of the Hadoop Distributed File System (HDFS).However, Spark provides an easier to use alternative toHadoop MapReduce and offers performance up to 10times faster than previous generation systems likeHadoop MapReduce for certain applications. Spark is a framework for writing fast, distributedprograms. Spark solves similar problems as HadoopMapReduce does but with a fast in-memory approachand a clean functional style API. With its ability tointegrate with Hadoop and inbuilt tools for interactivequery analysis (Shark), large-scale graph processingand analysis (Bagel), and real-time analysis (SparkStreaming), it can be interactively used to quicklyprocess and query big data sets. To make programming faster, Spark provides clean,concise APIs in Scala, Java and Python. You can alsouse Spark interactively from the Scala and Pythonshells to rapidly query big datasets. Spark is also theengine behind Shark, a fully Apache Hive-compatibledata warehousing system that can run 100x faster thanHive.
1. Apache Spark 2. Mirror of Spark onGithub 3. RDDs - Paper 4. Spark: ClusterComputing... - Paper Spark Research
Apache Storm Storm is a complex event processor (CEP) anddistributed computation framework writtenpredominantly in the Clojure programming language.Is a distributed real-time computation system forprocessing fast, large streams of data. Storm is anarchitecture based on master-workers paradigma. So aStorm cluster mainly consists of a master and worker
1. Storm Project/ 2. Storm-on-YARN
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 6/29
nodes, with coordination done by Zookeeper. Storm makes use of zeromq (0mq, zeromq), anadvanced, embeddable networking library. It providesa message queue, but unlike message-orientedmiddleware (MOM), a 0MQ system can run without adedicated message broker. The library is designed tohave a familiar socket-style API. Originally created by Nathan Marz and team atBackType, the project was open sourced after beingacquired by Twitter. Storm was initially developed anddeployed at BackType in 2011. After 7 months ofdevelopment BackType was acquired by Twitter inJuly 2011. Storm was open sourced in September 2011.Hortonworks is developing a Storm-on-YARN versionand plans finish the base-level integration in 2013 Q4.This is the plan from Hortonworks.Yahoo/Hortonworks also plans to move Storm-on-YARN code from github.com/yahoo/storm-yarn to be asubproject of Apache Storm project in the near future. Twitter has recently released a Hadoop-Storm Hybridcalled “Summingbird.” Summingbird fuses the twoframeworks into one, allowing for developers to useStorm for short-term processing and Hadoop for deepdata dives,. a system that aims to mitigate the tradeoffsbetween batch processing and stream processing bycombining them into a hybrid system.
Apache Flink
Apache Flink (formerly called Stratosphere) featurespowerful programming abstractions in Java and Scala,a high-performance runtime, and automatic programoptimization. It has native support for iterations,incremental iterations, and programs consisting oflarge DAGs of operations. Flink is a data processing system and an alternative toHadoop's MapReduce component. It comes with itsown runtime, rather than building on top ofMapReduce. As such, it can work completelyindependently of the Hadoop ecosystem. However,Flink can also access Hadoop's distributed file system(HDFS) to read and write data, and Hadoop's next-generation resource manager (YARN) to provisioncluster resources. Since most Flink users are usingHadoop HDFS to store their data, it ships already therequired libraries to access HDFS.
1. Apache Flinkincubator page 2. Stratosphere site
Apache Apex Apache Apex is an enterprise grade Apache YARNbased big data-in-motion platform that unifies streamprocessing as well as batch processing. It processes bigdata in-motion in a highly scalable, highly performant,fault tolerant, stateful, secure, distributed, and an easilyoperable way. It provides a simple API that enablesusers to write or re-use generic Java code, therebylowering the expertise needed to write big dataapplications.
1. Apache Apex fromDataTorrent 2. Apache Apex mainpage 3. Apache ApexProposal
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 7/29
The Apache Apex platform is supplemented by ApacheApex-Malhar, which is a library of operators thatimplement common business logic functions needed bycustomers who want to quickly develop applications.These operators provide access to HDFS, S3, NFS,FTP, and other file systems; Kafka, ActiveMQ,RabbitMQ, JMS, and other message systems; MySql,Cassandra, MongoDB, Redis, HBase, CouchDB andother databases along with JDBC connectors. Thelibrary also includes a host of other common businesslogic patterns that help users to significantly reduce thetime it takes to go into production. Ease of integrationwith all other big data technologies is one of theprimary missions of Apache Apex-Malhar.
Apex, available on GitHub, is the core technologyupon which DataTorrent's commercial offering,DataTorrent RTS 3, along with other technology suchas a data ingestion tool called dtIngest, are based.
Netflix PigPen
PigPen is map-reduce for Clojure which compiles toApache Pig. Clojure is dialect of the Lisp programminglanguage created by Rich Hickey, so is a functionalgeneral-purpose language, and runs on the Java VirtualMachine, Common Language Runtime, and JavaScriptengines. In PigPen there are no special user definedfunctions (UDFs). Define Clojure functions,anonymously or named, and use them like you wouldin any Clojure program. This tool is open sourced byNetflix, Inc. the American provider of on-demandInternet streaming media.
1. PigPen on GitHub
AMPLab SIMR
Apache Spark was developed thinking in ApacheYARN. However, up to now, it has been relatively hardto run Apache Spark on Hadoop MapReduce v1clusters, i.e. clusters that do not have YARN installed.Typically, users would have to get permission to installSpark/Scala on some subset of the machines, a processthat could be time consuming. SIMR allows anyonewith access to a Hadoop MapReduce v1 cluster to runSpark out of the box. A user can run Spark directly ontop of Hadoop MapReduce v1 without anyadministrative rights, and without having Spark orScala installed on any of the nodes.
1. SIMR on GitHub
Facebook Corona “The next version of Map-Reduce" from Facebook,based in own fork of Hadoop. The current Hadoopimplementation of the MapReduce technique uses asingle job tracker, which causes scaling issues for verylarge data sets. The Apache Hadoop developers havebeen creating their own next-generation MapReduce,called YARN, which Facebook engineers looked at butdiscounted because of the highly-customised nature ofthe company's deployment of Hadoop and HDFS.
1. Corona on Github
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 8/29
Corona, like YARN, spawns multiple job trackers (onefor each job, in Corona's case).
Apache REEF
Apache REEF™ (Retainable Evaluator ExecutionFramework) is a library for developing portableapplications for cluster resource managers such asApache Hadoop™ YARN or Apache Mesos™. ApacheREEF drastically simplifies development of thoseresource managers through the following features:
Centralized Control Flow: Apache REEF turnsthe chaos of a distributed application into eventsin a single machine, the Job Driver. Eventsinclude container allocation, Task launch,completion and failure. For failures, ApacheREEF makes every effort of making the actual`Exception` thrown by the Task available to theDriver.Task runtime: Apache REEF provides a Taskruntime called Evaluator. Evaluators areinstantiated in every container of a REEFapplication. Evaluators can keep data in memoryin between Tasks, which enables efficientpipelines on REEF.Support for multiple resource managers: ApacheREEF applications are portable to any supportedresource manager with minimal effort. Further,new resource managers are easy to support inREEF..NET and Java API: Apache REEF is the onlyAPI to write YARN or Mesos applications in.NET. Further, a single REEF application is freeto mix and match Tasks written for .NET orJava.Plugins: Apache REEF allows for plugins (called"Services") to augment its feature set withoutadding bloat to the core. REEF includes manyServices, such as a name-based communicationsbetween Tasks MPI-inspired groupcommunications (Broadcast, Reduce, Gather, ...)and data ingress.
1. Apache REEFWebsite
Apache Twill Twill is an abstraction over Apache Hadoop® YARNthat reduces the complexity of developing distributedapplications, allowing developers to focus more ontheir business logic. Twill uses a simple thread-basedmodel that Java programmers will find familiar. YARNcan be viewed as a compute fabric of a cluster, whichmeans YARN applications like Twill will run on anyHadoop 2 cluster. YARN is an open source application that allows theHadoop cluster to turn into a collection of virtualmachines. Weave, developed by Continuuity andinitially housed on Github, is a complementary opensource application that uses a programming model
1. Apache TwillIncubator
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 9/29
similar to Java threads, making it easy to writedistributed applications. In order to remove a conflictwith a similarly named project on Apache, called"Weaver," Weave's name changed to Twill when itmoved to Apache incubation. Twill functions as a scaled-out proxy. Twill is amiddleware layer in between YARN and anyapplication on YARN. When you develop a Twill app,Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java. It is very easy tobuild multi-processed distributed applications in Twill.
Damballa Parkour
Library for develop MapReduce programs using theLISP like language Clojure. Parkour aims to providedeep Clojure integration for Hadoop. Programs usingParkour are normal Clojure programs, using standardClojure functions instead of new frameworkabstractions. Programs using Parkour are also fullHadoop programs, with complete access to absolutelyeverything possible in raw Java Hadoop MapReduce.
1. Parkour GitHubProject
Apache Hama
Apache Top-Level open source project, allowing youto do advanced analytics beyond MapReduce. Manydata analysis techniques such as machine learning andgraph algorithms require iterative computations, this iswhere Bulk Synchronous Parallel model can be moreeffective than "plain" MapReduce.
1. Hama site
Datasalt Pangool A new MapReduce paradigm. A new API for MR jobs,in higher level than Java.
1.Pangool 2.GitHub Pangool
Apache Tez
Tez is a proposal to develop a generic applicationwhich can be used to process complex data-processingtask DAGs and runs natively on Apache HadoopYARN. Tez generalizes the MapReduce paradigm to amore powerful framework based on expressingcomputations as a dataflow graph. Tez is not meantdirectly for end-users – in fact it enables developers tobuild end-user applications with much betterperformance and flexibility. Hadoop has traditionallybeen a batch-processing platform for large amounts ofdata. However, there are a lot of use cases for near-real-time performance of query processing. There arealso several workloads, such as Machine Learning,which do not fit will into the MapReduce paradigm.Tez helps Hadoop address these use cases. Tezframework constitutes part of Stinger initiative (a lowlatency based SQL type query interface for Hadoopbased on Hive).
1. Apache TezIncubator 2. HortonworksApache Tez page
Apache DataFu DataFu provides a collection of Hadoop MapReducejobs and functions in higher level languages based on itto perform data analysis. It provides functions forcommon statistics tasks (e.g. quantiles, sampling),PageRank, stream sessionization, and set and bagoperations. DataFu also provides Hadoop jobs forincremental data processing in MapReduce. DataFu isa collection of Pig UDFs (including PageRank,
1. DataFu ApacheIncubator
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 10/29
sessionization, set operations, sampling, and muchmore) that were originally developed at LinkedIn.
Pydoop
Pydoop is a Python MapReduce and HDFS API forHadoop, built upon the C++ Pipes and the C libhdfsAPIs, that allows to write full-fledged MapReduceapplications with HDFS access. Pydoop has severaladvantages over Hadoop’s built-in solutions for Pythonprogramming, i.e., Hadoop Streaming and Jython:being a CPython package, it allows you to access allstandard library and third party modules, some ofwhich may not be available.
1. SF Pydoop site 2. Pydoop GitHubProject
Kangaroo
Open-source project from Conductor for writingMapReduce jobs consuming data from Kafka. Theintroductory post explains Conductor’s use case—loading data from Kafka to HBase by way of aMapReduce job using the HFileOutputFormat. Unlikeother solutions which are limited to a single InputSplitper Kafka partition, Kangaroo can launch multipleconsumers at different offsets in the stream of a singlepartition for increased throughput and parallelism.
1. KangarooIntroduction 2. Kangaroo GitHubProject
TinkerPop
Graph computing framework written in Java. Providesa core API that graph system vendors can implement.There are various types of graph systems including in-memory graph libraries, OLTP graph databases, andOLAP graph processors. Once the core interfaces areimplemented, the underlying graph system can bequeried using the graph traversal language Gremlin andprocessed with TinkerPop-enabled algorithms. Formany, TinkerPop is seen as the JDBC of the graphcomputing community.
1. Apache TinkerpopProposal 2. TinkerPop site
Pachyderm MapReduce
Pachyderm is a completely new MapReduce enginebuilt on top Docker and CoreOS. In PachydermMapReduce (PMR) a job is an HTTP server inside aDocker container (a microservice). You givePachyderm a Docker image and it will automaticallydistribute it throughout the cluster next to your data.Data is POSTed to the container over HTTP and theresults are stored back in the file system. You canimplement the web server in any language you wantand pull in any library. Pachyderm also creates a DAGfor all the jobs in the system and their dependenciesand it automatically schedules the pipeline such thateach job isn’t run until it’s dependencies havecompleted. Everything in Pachyderm “speaks in diffs”so it knows exactly which data has changed and whichsubsets of the pipeline need to be rerun. CoreOS is anopen source lightweight operating system based onChrome OS, actually CoreOS is a fork of Chrome OS.CoreOS provides only the minimal functionalityrequired for deploying applications inside softwarecontainers, together with built-in mechanisms forservice discovery and configuration sharing
1. Pachyderm site 2. Pachydermintroduction article
Apache Beam Apache Beam is an open source, unified model for 1. Apache Beam
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 11/29
defining and executing data-parallel processingpipelines, as well as a set of language-specific SDKsfor constructing pipelines and runtime-specificRunners for executing them.
The model behind Beam evolved from a number ofinternal Google data processing projects, includingMapReduce, FlumeJava, and Millwheel. This modelwas originally known as the “Dataflow Model” andfirst implemented as Google Cloud Dataflow, includinga Java SDK on GitHub for writing pipelines and fullymanaged service for executing them on Google CloudPlatform.
In January 2016, Google and a number of partnerssubmitted the Dataflow Programming Model andSDKs portion as an Apache Incubator Proposal, underthe name Apache Beam (unified Batch + strEAMprocessing).
Proposal 2. DataFlow Beamand SparkComparasion
NoSQL DatabasesColumn Data Model
Apache HBase
Google BigTable Inspired. Non-relational distributeddatabase. Ramdom, real-time r/w operations incolumn-oriented very large tables (BDDB: Big DataData Base). It’s the backing system for MR jobsoutputs. It’s the Hadoop database. It’s for backingHadoop MapReduce jobs with Apache HBase tables
1. Apache HBaseHome 2. Mirror of HBaseon Github
Apache Cassandra
Distributed Non-SQL DBMS, it’s a BDDB. MR canretrieve data from Cassandra. This BDDB can runwithout HDFS, or on-top of HDFS (DataStax fork ofCassandra). HBase and its required supporting systemsare derived from what is known of the original GoogleBigTable and Google File System designs (as knownfrom the Google File System paper Google publishedin 2003, and the BigTable paper published in 2006).Cassandra on the other hand is a recent open sourcefork of a standalone database system initially coded byFacebook, which while implementing the BigTabledata model, uses a system inspired by Amazon’sDynamo for storing data (in fact much of the initialdevelopment work on Cassandra was performed bytwo Dynamo engineers recruited to Facebook fromAmazon).
1. Apache HBaseHome 2. Cassandra onGitHub 3. Training Resources4. Cassandra - Paper
Hypertable
Database system inspired by publications on the designof Google's BigTable. The project is based onexperience of engineers who were solving large-scaledata-intensive tasks for many years. Hypertable runson top of a distributed file system such as the ApacheHadoop DFS, GlusterFS, or the Kosmos File System(KFS). It is written almost entirely in C++. Sposoredby Baidu the Chinese search engine.
TODO
Apache Accumulo Distributed key/value store is a robust, scalable, high 1. Apache Accumulo
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 12/29
performance data storage and retrieval system. ApacheAccumulo is based on Google's BigTable design and isbuilt on top of Apache Hadoop, Zookeeper, and Thrift.Accumulo is software created by the NSA withsecurity features.
Home
Apache Kudu
Distributed, columnar, relational data store optimizedfor analytical use cases requiring very fast reads withcompetitive write speeds.
Relational data model (tables) with strongly-typed columns and a fast, online alter tableoperation.Scale-out and sharded with support forpartitioning based on key ranges and/or hashing.Fault-tolerant and consistent due to itsimplementation of Raft consensus.Supported by Apache Impala and Apache Drill,enabling fast SQL reads and writes through thosesystems.Integrates with MapReduce and Spark.Additionally provides "NoSQL" APIs in Java,Python, and C++.
1. Apache KuduHome 2. Kudu on Github3. Kudu technicalwhitepaper (pdf)
Apache Parquet
Columnar storage format available to any project in theHadoop ecosystem, regardless of the choice of dataprocessing framework, data model or programminglanguage.
1. Apache ParquetHome 2. Apache Parquet onGithub
Document Data Model
MongoDB
Document-oriented database system. It is part of theNoSQL family of database systems. Instead of storingdata in tables as is done in a "classical" relationaldatabase, MongoDB stores structured data as JSON-like documents
1. Mongodb site
RethinkDB
RethinkDB is built to store JSON documents, and scaleto multiple machines with very little effort. It has apleasant query language that supports really usefulqueries like table joins and group by, and is easy tosetup and learn.
1. RethinkDB site
ArangoDB
An open-source database with a flexible data model fordocuments, graphs, and key-values. Build highperformance applications using a convenient sql-likequery language or JavaScript extensions.
1. ArangoDB site
Stream Data ModelEventStore An open-source, functional database with support for
Complex Event Processing. It provides a persistenceengine for applications using event-sourcing, or forstoring time-series data. Event Store is written in C#,C++ for the server which runs on Mono or the .NETCLR, on Linux or Windows. Applications using EventStore can be written in JavaScript. Event sourcing (ES)is a way of persisting your application's state by storing
1. EventStore site
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 13/29
the history that determines the current state of yourapplication.
Key-value Data Model
Redis DataBase
Redis is an open-source, networked, in-memory, datastructures store with optional durability. It is written inANSI C. In its outer layer, the Redis data model is adictionary which maps keys to values. One of the maindifferences between Redis and other structured storagesystems is that Redis supports not only strings, but alsoabstract data types. Sponsored by Redis Labs. It’s BSDlicensed.
1. Redis site 2. Redis Labs site
Linkedin Voldemort Distributed data store that is designed as a key-valuestore used by LinkedIn for high-scalability storage. 1. Voldemort site
RocksDB
RocksDB is an embeddable persistent key-value storefor fast storage. RocksDB can also be the foundationfor a client-server database but our current focus is onembedded workloads.
1. RocksDB site
OpenTSDB
OpenTSDB is a distributed, scalable Time SeriesDatabase (TSDB) written on top of HBase. OpenTSDBwas written to address a common need: store, indexand serve metrics collected from computer systems(network gear, operating systems, applications) at alarge scale, and make this data easily accessible andgraphable.
1. OpenTSDB site
Graph Data Model
ArangoDB
An open-source database with a flexible data model fordocuments, graphs, and key-values. Build highperformance applications using a convenient sql-likequery language or JavaScript extensions.
1. ArangoDB site
Neo4j
An open-source graph database writting entirely inJava. It is an embedded, disk-based, fully transactionalJava persistence engine that stores data structured ingraphs rather than in tables.
1. Neo4j site
TitanDB
TitanDB is a highly scalable graph database optimizedfor storing and querying large graphs with billions ofvertices and edges distributed across a multi-machinecluster. Titan is a transactional database that cansupport thousands of concurrent users.
1. Titan site
NewSQL Databases
TokuDB
TokuDB is a storage engine for MySQL and MariaDBthat is specifically designed for high performance onwrite-intensive workloads. It achieves this via FractalTree indexing. TokuDB is a scalable, ACID andMVCC compliant storage engine. TokuDB is one ofthe technologies that enable Big Data in MySQL.
1. Percona TokuDBsite
HandlerSocket HandlerSocket is a NoSQL plugin forMySQL/MariaDB (the storage engine of MySQL). Itworks as a daemon inside the mysqld process,accepting TCP connections, and executing requestsfrom clients. HandlerSocket does not support SQLqueries. Instead, it supports simple CRUD operations
TODO
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 14/29
on tables. HandlerSocket can be much faster thanmysqld/libmysql in some cases because it has lowerCPU, disk, and network overhead.
Akiban Server
Akiban Server is an open source database that bringsdocument stores and relational databases together.Developers get powerful document access alongsidesurprisingly powerful SQL.
TODO
Drizzle
Drizzle is a re-designed version of the MySQL v6.0codebase and is designed around a central concept ofhaving a microkernel architecture. Features such as thequery cache and authentication system are now pluginsto the database, which follow the general theme of"pluggable storage engines" that were introduced inMySQL 5.1. It supports PAM, LDAP, and HTTPAUTH for authentication via plugins it ships. Via itsplugin system it currently supports logging to files,syslog, and remote services such as RabbitMQ andGearman. Drizzle is an ACID-compliant relationaldatabase that supports transactions via an MVCCdesign
TODO
Haeinsa
Haeinsa is linearly scalable multi-row, multi-tabletransaction library for HBase. Use Haeinsa if you needstrong ACID semantics on your HBase cluster. Is basedon Google Perlocator concept.
1. Haeinsa GitHubsite
SenseiDB
Open-source, distributed, realtime, semi-structureddatabase. Some Features: Full-text search, Fastrealtime updates, Structured and faceted search, BQL:SQL-like query language, Fast key-value lookup, Highperformance under concurrent heavy update and queryvolumes, Hadoop integration
1. SenseiDB site
Sky
Sky is an open source database used for flexible, highperformance analysis of behavioral data. For certainkinds of data such as clickstream data and log data, itcan be several orders of magnitude faster thantraditional approaches such as SQL databases orHadoop.
1. SkyDB site
BayesDB
BayesDB, a Bayesian database table, lets users querythe probable implications of their tabular data as easilyas an SQL database lets them query the data itself.Using the built-in Bayesian Query Language (BQL),users with no statistics training can solve basic datascience problems, such as detecting predictiverelationships between variables, inferring missingvalues, simulating probable observations, andidentifying statistically similar database entries.
1. BayesDB site
InfluxDB InfluxDB is an open source distributed time seriesdatabase with no external dependencies. It's useful forrecording metrics, events, and performing analytics. Ithas a built-in HTTP API so you don't have to write anyserver side code to get up and running. InfluxDB isdesigned to be scalable, simple to install and manage,and fast to get data in and out. It aims to answer
1. InfluxDB site
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 15/29
queries in real-time. That means every data point isindexed as it comes in and is immediately available inqueries that should return under 100ms.
SQL-on-Hadoop
Apache HiveData Warehouse infrastructure developed by Facebook.Data summarization, query, and analysis. It’s providesSQL-like language (not SQL92 compliant): HiveQL.
1. Apache HIVE site 2. Apache HIVEGitHub Project
Apache HCatalog
HCatalog’s table abstraction presents users with arelational view of data in the Hadoop Distributed FileSystem (HDFS) and ensures that users need not worryabout where or in what format their data is stored.Right now HCatalog is part of Hive. Only old versionsare separated for download.
TODO
Apache Trafodion
Apache Trafodion is a webscale SQL-on-Hadoopsolution enabling enterprise-class transactional andoperational workloads on HBase. Trafodion is a nativeMPP ANSI SQL database engine that builds on thescalability, elasticity and flexibility of HDFS andHBase, extending these to provide guaranteedtransactional integrity for all workloads includingmulti-column, multi-row, multi-table, and multi-serverupdates.
1. Apache Trafodionwebsite 2. Apache Trafodionwiki 3. Apache TrafodionGitHub Project
Apache HAWQ
Apache HAWQ is a Hadoop native SQL query enginethat combines key technological advantages of MPPdatabase evolved from Greenplum Database, with thescalability and convenience of Hadoop.
1. Apache HAWQsite 2. HAWQ GitHubProject
Apache Drill
Drill is the open source version of Google's Dremelsystem which is available as an infrastructure servicecalled Google BigQuery. In recent years open sourcesystems have emerged to address the need for scalablebatch processing (Apache Hadoop) and streamprocessing (Storm, Apache S4). Apache Hadoop,originally inspired by Google's internal MapReducesystem, is used by thousands of organizationsprocessing large-scale datasets. Apache Hadoop isdesigned to achieve very high throughput, but is notdesigned to achieve the sub-second latency needed forinteractive data analysis and exploration. Drill, inspiredby Google's internal Dremel system, is intended toaddress this need
1. Apache IncubatorDrill
Cloudera Impala
The Apache-licensed Impala project brings scalableparallel database technology to Hadoop, enabling usersto issue low-latency SQL queries to data stored inHDFS and Apache HBase without requiring datamovement or transformation. It's a Google Dremelclone (Big Query google).
1. Cloudera Impalasite 2. Impala GitHubProject
Facebook Presto
Facebook has open sourced Presto, a SQL engine itsays is on average 10 times faster than Hive forrunning queries across large data sets stored in Hadoopand elsewhere.
1. Presto site
Datasalt Splout SQL Splout allows serving an arbitrarily big dataset withhigh QPS rates and at the same time provides full SQL
TODO
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 16/29
query syntax.
Apache Tajo
Apache Tajo is a robust big data relational anddistributed data warehouse system for Apache Hadoop.Tajo is designed for low-latency and scalable ad-hocqueries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored onHDFS (Hadoop Distributed File System) and otherdata sources. By supporting SQL standards andleveraging advanced database techniques, Tajo allowsdirect control of distributed execution and data flowacross a variety of query evaluation strategies andoptimization opportunities. For reference, the ApacheSoftware Foundation announced Tajo as a Top-LevelProject in April 2014.
1. Apache Tajo site
Apache Phoenix
Apache Phoenix is a SQL skin over HBase delivered asa client-embedded JDBC driver targeting low latencyqueries over HBase data. Apache Phoenix takes yourSQL query, compiles it into a series of HBase scans,and orchestrates the running of those scans to produceregular JDBC result sets. The table metadata is storedin an HBase table and versioned, such that snapshotqueries over prior versions will automatically use thecorrect schema. Direct use of the HBase API, alongwith coprocessors and custom filters, results inperformance on the order of milliseconds for smallqueries, or seconds for tens of millions of rows.
1. Apache Phoenixsite
Apache MRQL
MRQL is a query processing and optimization systemfor large-scale, distributed data analysis, built on top ofApache Hadoop, Hama, and Spark. MRQL (pronounced miracle) is a query processing andoptimization system for large-scale, distributed dataanalysis. MRQL (the MapReduce Query Language) isan SQL-like query language for large-scale dataanalysis on a cluster of computers. The MRQL queryprocessing system can evaluate MRQL queries in threemodes:
in Map-Reduce mode using Apache Hadoop,in BSP mode (Bulk Synchronous Parallel mode)using Apache Hama, andin Spark mode using Apache Spark.in Flink mode using Apache Flink.
1. Apache IncubatorMRQL site
Kylin
Kylin is an open source Distributed Analytics Enginefrom eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supportingextremely large datasets
1. Kylin project site
Data IngestionApache Flume Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and movinglarge amounts of log data. It has a simple and flexiblearchitecture based on streaming data flows. It is robustand fault tolerant with tunable reliability mechanisms
1. Apache Flumeproject site
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 17/29
and many failover and recovery mechanisms. It uses asimple extensible data model that allows for onlineanalytic application.
Apache SqoopSystem for bulk data transfer between HDFS andstructured datastores as RDBMS. Like Flume but fromHDFS to RDBMS.
1. Apache Sqoopproject site
Facebook Scribe Log agregator in real-time. It’s a Apache ThriftService.
1. Facebook ScribeGitHub site
Apache Chukwa Large scale log aggregator, and analytics. 1. Apache Chukwasite
Apache Kafka
Distributed publish-subscribe system for processinglarge amounts of streaming data. Kafka is a MessageQueue developed by LinkedIn that persists messages todisk in a very performant manner. Because messagesare persisted, it has the interesting ability for clients torewind a stream and consume the messages again.Another upside of the disk persistence is that bulkimporting the data into HDFS for offline analysis canbe done very quickly and efficiently. Storm, developedby BackType (which was acquired by Twitter a yearago), is more about transforming a stream of messagesinto new streams.
1. Apache Kafka 2. GitHub sourcecode
Netflix SuroSuro has its roots in Apache Chukwa, which wasinitially adopted by Netflix. Is a log agregattor likeStorm, Samza.
TODO
Apache Samza
Apache Samza is a distributed stream processingframework. It uses Apache Kafka for messaging, andApache Hadoop YARN to provide fault tolerance,processor isolation, security, and resourcemanagement. Developed byhttp://www.linkedin.com/in/jaykreps Linkedin.
1. Apache Samza site
Cloudera Morphline
Cloudera Morphlines is a new open source frameworkthat reduces the time and skills necessary to integrate,build, and change Hadoop processing applications thatextract, transform, and load data into Apache Solr,Apache HBase, HDFS, enterprise data warehouses, oranalytic online dashboards.
TODO
HIHO
This project is a framework for connecting disparatedata sources with the Apache Hadoop system, makingthem interoperable. HIHO connects Hadoop withmultiple RDBMS and file systems, so that data can beloaded to Hadoop and unloaded from Hadoop
TODO
Apache NiFi Apache NiFi is a dataflow system that is currentlyunder incubation at the Apache Software Foundation.NiFi is based on the concepts of flow-basedprogramming and is highly configurable. NiFi uses acomponent based extension model to rapidly addcapabilities to complex dataflows. Out of the box NiFihas several extensions for dealing with file-baseddataflows such as FTP, SFTP, and HTTP integration aswell as integration with HDFS. One of NiFi’s unique
1. Apache NiFi
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 18/29
features is a rich, web-based interface for designing,controlling, and monitoring a dataflow.
Apache ManifoldCF
Apache ManifoldCF provides a framework forconnecting source content repositories like filesystems, DB, CMIS, SharePoint, FileNet ... to targetrepositories or indexes, such as Apache Solr orElasticSearch. It's a kind of crawler for multi-contentrepositories, supporting a lot of sources and multi-format conversion for indexing by means of ApacheTika Content Extractor transformation filter.
1. ApacheManifoldCF
Service Programming
Apache Thrift
A cross-language RPC framework for servicecreations. It’s the service base for Facebooktechnologies (the original Thrift contributor). Thriftprovides a framework for developing and accessingremote services. It allows developers to create servicesthat can be consumed by any application that is writtenin a language that there are Thrift bindings for. Thriftmanages serialization of data to and from a service, aswell as the protocol that describes a method invocation,response, etc. Instead of writing all the RPC code --you can just get straight to your service logic. Thriftuses TCP and so a given service is bound to aparticular port.
1. Apache Thrift
Apache Zookeeper
It’s a coordination service that gives you the tools youneed to write correct distributed applications.ZooKeeper was developed at Yahoo! Research. SeveralHadoop projects are already using ZooKeeper tocoordinate the cluster and provide highly-availabledistributed services. Perhaps most famous of those areApache HBase, Storm, Kafka. ZooKeeper is anapplication library with two principal implementationsof the APIs—Java and C—and a service componentimplemented in Java that runs on an ensemble ofdedicated servers. Zookeeper is for building distributedsystems, simplifies the development process, making itmore agile and enabling more robust implementations.Back in 2006, Google published a paper on "Chubby",a distributed lock service which gained wide adoptionwithin their data centers. Zookeeper, not surprisingly,is a close clone of Chubby designed to fulfill many ofthe same roles for HDFS and other Hadoopinfrastructure.
1. Apache Zookeeper2. Google Chubbypaper
Apache Avro
Apache Avro is a framework for modeling, serializingand making Remote Procedure Calls (RPC). Avro datais described by a schema, and one interesting feature isthat the schema is stored in the same file as the data itdescribes, so files are self-describing. Avro does notrequire code generation. This framework can competewith other similar tools like: Apache Thrift, GoogleProtocol Buffers, ZeroC ICE, and so on.
1. Apache Avro
Apache Curator Curator is a set of Java libraries that make usingApache ZooKeeper much easier.
TODO
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 19/29
Apache ZooKeeper much easier.
Apache karaf
Apache Karaf is an OSGi runtime that runs on top ofany OSGi framework and provides you a set ofservices, a powerful provisioning concept, anextensible shell and more.
TODO
Twitter Elephant Bird
Elephant Bird is a project that provides utilities(libraries) for working with LZOP-compressed data. Italso provides a container format that supports workingwith Protocol Buffers, Thrift in MapReduce, Writables,Pig LoadFuncs, Hive SerDe, HBase miscellanea. Thisopen source library is massively used in Twitter.
1. Elephant BirdGitHub
Linkedin Norbert
Norbert is a library that provides easy clustermanagement and workload distribution. With Norbert,you can quickly distribute a simple client/serverarchitecture to create a highly scalable architecturecapable of handling heavy traffic. Implemented inScala, Norbert wraps ZooKeeper, Netty and usesProtocol Buffers for transport to make it easy to build acluster aware application. A Java API is provided andpluggable load balancing strategies are supported withround robin and consistent hash strategies provided outof the box.
1. Linkedin Project 2. GitHub sourcecode
Scheduling & DR
Apache OozieWorkflow scheduler system for MR jobs using DAGs(Direct Acyclical Graphs). Oozie Coordinator cantrigger jobs by time (frequency) and data availability
1. Apache Oozie 2. GitHub sourcecode
LinkedIn AzkabanHadoop workflow management. A batch job schedulercan be seen as a combination of the cron and makeUnix utilities combined with a friendly UI.
LinkedIn Azkaban
Apache Falcon
Apache Falcon is a data management framework forsimplifying data lifecycle management and processingpipelines on Apache Hadoop. It enables users toconfigure, manage and orchestrate data motion,pipeline processing, disaster recovery, and dataretention workflows. Instead of hard-coding complexdata lifecycle capabilities, Hadoop applications cannow rely on the well-tested Apache Falcon frameworkfor these functions. Falcon’s simplification of datamanagement is quite useful to anyone building apps onHadoop. Data Management on Hadoop encompassesdata motion, process orchestration, lifecyclemanagement, data discovery, etc. among otherconcerns that are beyond ETL. Falcon is a new dataprocessing and management platform for Hadoop thatsolves this problem and creates additionalopportunities by building on existing componentswithin the Hadoop ecosystem (ex. Apache Oozie,Apache Hadoop DistCp etc.) without reinventing thewheel.
Apache Falcon
Schedoscope Schedoscope is a new open-source project providing ascheduling framework for painfree agile development,testing, (re)loading, and monitoring of your datahub,lake, or whatever you choose to call your Hadoop data
GitHub source code
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 20/29
warehouse these days. Datasets (includingdependencies) are defined using a scala DSL, whichcan embed MapReduce jobs, Pig scripts, Hive queriesor Oozie workflows to build the dataset. The toolincludes a test framework to verify logic and acommand line utility to load and reload data.
Machine Learning
Apache Mahout Machine learning library and math library, on top ofMapReduce. Apache Mahout
WEKA
Weka (Waikato Environment for Knowledge Analysis)is a popular suite of machine learning software writtenin Java, developed at the University of Waikato, NewZealand. Weka is free software available under theGNU General Public License.
Weka 3
Cloudera Oryx
The Oryx open source project provides simple, real-time large-scale machine learning / predictive analyticsinfrastructure. It implements a few classes of algorithmcommonly used in business applications: collaborativefiltering / recommendation, classification / regression,and clustering.
1. Oryx at GitHub 2. Cloudera forum forMachine Learning
Deeplearning4j
The Deeplearning4j open-source project is the mostwidely used deep-learning framework for the JVM.DL4J includes deep neural nets such as recurrentneural networks, Long Short Term Memory Networks(LSTMs), convolutional neural networks, variousautoencoders and feedforward neural networks such asrestricted Boltzmann machines and deep-beliefnetworks. It also has natural language-processingalgorithms such as word2vec, doc2vec, GloVe and TF-IDF. All Deeplearning4j networks run distributed onmultiple CPUs and GPUs. They work as Hadoop jobs,and integrate with Spark on the slace level for host-thread orchestration. Deeplearning4j's neural networksare applied to use cases such as fraud and anomalydetection, recommender systems, and predictivemaintenance.
1. Deeplearning4jWebsite 2. Gitter Communityfor Deeplearning4j
MADlib
The MADlib project leverages the data-processingcapabilities of an RDBMS to analyze data. The aim ofthis project is the integration of statistical data analysisinto databases. The MADlib project is self-described asthe Big Data Machine Learning in SQL for DataScientists. The MADlib software project began thefollowing year as a collaboration between researchersat UC Berkeley and engineers and data scientists atEMC/Greenplum (now Pivotal)
1. MADlibCommunity
H2OH2O is a statistical, machine learning and mathruntime tool for bigdata analysis. Developed by thepredictive analytics company H2O.ai, H2O hasestablished a leadership in the ML scene together withR and Databricks’ Spark. According to the team, H2Ois the world’s fastest in-memory platform for machine
1. H2O at GitHub 2. H2O Blog
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 21/29
learning and predictive analytics on big data. It isdesigned to help users scale machine learning, math,and statistics over large datasets.
In addition to H2O’s point and click Web-UI, its RESTAPI allows easy integration into various clients. Thismeans explorative analysis of data can be done ina typical fashion in R, Python, and Scala;and entire workflows can be written up as automatedscripts.
Sparkling Water
Sparkling Water combines two open sourcetechnologies: Apache Spark and H2O - a machinelearning engine. It makes H2O’s library of AdvancedAlgorithms including Deep Learning, GLM, GBM,KMeans, PCA, and Random Forest accessible fromSpark workflows. Spark users are provided with theoptions to select the best features from either platformsto meet their Machine Learning needs. Users cancombine Sparks’ RDD API and Spark MLLib withH2O’s machine learning algorithms, or use H2Oindependent of Spark in the model building processand post-process the results in Spark.
Sparkling Water provides a transparent integration ofH2O’s framework and data structures into Spark’sRDD-based environment by sharing the sameexecution space as well as providing a RDD-like APIfor H2O data structures.
1. Sparkling Water atGitHub 2. Sparkling WaterExamples
Apache SystemML
Apache SystemML was open sourced by IBM and it'spretty related with Apache Spark. If you thinking inApache Spark as the analytics operating system for anyapplication that taps into huge volumes of streamingdata. MLLib, the machine learning library for Spark,provides developers with a rich set of machine learningalgorithms. And SystemML enables developers totranslate those algorithms so they can easily digestdifferent kinds of data and to run on different kinds ofcomputers.
SystemML allows a developer to write a singlemachine learning algorithm and automatically scale itup using Spark or Hadoop.
SystemML scales for big data analytics with highperformance optimizer technology, and empowersusers to write customized machine learning algorithmsusing simple, domain-specific language (DSL) withoutlearning complicated distributed programming. It is anextensible complement framework of Spark MLlib.
1. Apache SystemML2. Apache Proposal
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 22/29
Benchmarking and QA Tools
Apache Hadoop Benchmarking
There are two main JAR files in Apache Hadoop forbenchmarking. This JAR are micro-benchmarks fortesting particular parts of the infrastructure, forinstance TestDFSIO analyzes the disk system, TeraSortevaluates MapReduce tasks, WordCount measurescluster performance, etc. Micro-Benchmarks arepackaged in the tests and exmaples JAR files, and youcan get a list of them, with descriptions, by invokingthe JAR file with no arguments. With regards ApacheHadoop 2.2.0 stable version we have available thefollowing JAR files for test, examples andbenchmarking. The Hadoop micro-benchmarks, arebundled in this JAR files: hadoop-mapreduce-examples-2.2.0.jar, hadoop-mapreduce-client-jobclient-2.2.0-tests.jar.
1. MAPREDUCE-3561 umbrella ticketto track all the issuesrelated toperformance
Yahoo Gridmix3 Hadoop cluster benchmarking from Yahoo engineerteam. TODO
PUMA Benchmarking
Benchmark suite which represents a broad range ofMapReduce applications exhibiting applicationcharacteristics with high/low computation andhigh/low shuffle volumes. There are a total of 13benchmarks, out of which Tera-Sort, Word-Count, andGrep are from Hadoop distribution. The rest of thebenchmarks were developed in-house and are currentlynot part of the Hadoop distribution. The threebenchmarks from Hadoop distribution are also slightlymodified to take number of reduce tasks as input fromthe user and generate final time completion statistics ofjobs.
1. MAPREDUCE-5116 2. Faraz Ahmadresearcher 3. PUMA Docs
Berkeley SWIM Benchmark
The SWIM benchmark (Statistical Workload Injectorfor MapReduce), is a benchmark representing a real-world big data workload developed by University ofCalifornia at Berkley in close cooperation withFacebook. This test provides rigorous measurements ofthe performance of MapReduce systems comprised ofreal industry workloads..
1. GitHub SWIN
Intel HiBench HiBench is a Hadoop benchmark suite. TODOApache Yetus To help maintain consistency over a large and
disconnected set of committers, automated patchtesting was added to Hadoop’s development process.This automated patch testing (now included as part ofApache Yetus) works as follows: when a patch isuploaded to the bug tracking system an automatedprocess downloads the patch, performs some staticanalysis, and runs the unit tests. These results areposted back to the bug tracker and alerts notifyinterested parties about the state of the patch.
However The Apache Yetus project addresses muchmore than the traditional patch testing, it's a betterapproach including a massive rewrite of the patchtesting facility used in Hadoop.
1. Altiscale BlogEntry 2. Apache YetusProposal 3. Apache YetusProject site
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 23/29
Security
Apache Sentry
Sentry is the next step in enterprise-grade big datasecurity and delivers fine-grained authorization to datastored in Apache Hadoop. An independent securitymodule that integrates with open source SQL queryengines Apache Hive and Cloudera Impala, Sentrydelivers advanced authorization controls to enablemulti-user applications and cross-functional processesfor enterprise data sets. Sentry was a Clouderadevelopment.
TODO
Apache Knox Gateway
System that provides a single point of secure access forApache Hadoop clusters. The goal is to simplifyHadoop security for both users (i.e. who access thecluster data and execute jobs) and operators (i.e. whocontrol access and manage the cluster). The Gatewayruns as a server (or cluster of servers) that serve one ormore Hadoop clusters.
1. Apache Knox 2. Apache KnoxGatewayHortonworks web
Apache Ranger
Apache Argus Ranger (formerly called Apache Argusor HDP Advanced Security) delivers comprehensiveapproach to central security policy administrationacross the core enterprise security requirements ofauthentication, authorization, accounting and dataprotection. It extends baseline features for coordinatedenforcement across Hadoop workloads from batch,interactive SQL and real–time and leverages theextensible architecture to apply policies consistentlyagainst additional Hadoop ecosystem components(beyond HDFS, Hive, and HBase) including Storm,Solr, Spark, and more.
1. Apache Ranger 2. Apache RangerHortonworks web
Metadata Management
Metascope
Metascope is a metadata management and datadiscovery tool which serves as an add-on toSchedoscope. Metascope is able to collect technical,operational and business metadata from your HadoopDatahub and provides them easy to search and navigatevia a portal.
GitHub source code
System Deployment
Apache Ambari
Intuitive, easy-to-use Hadoop management web UIbacked by its RESTful APIs. Apache Ambari wasdonated by Hortonworks team to the ASF. It's apowerful and nice interface for Hadoop and othertypical applications from the Hadoop ecosystem.Apache Ambari is under a heavy development, and itwill incorporate new features in a near future. Forexample Ambari is able to deploy a complete Hadoopsystem from scratch, however is not possible use thisGUI in a Hadoop system that is already running. Theability to provisioning the operating system could be agood addition, however probably is not in theroadmap..
1. Apache Ambari
Cloudera HUE Web application for interacting with Apache Hadoop. 1. HUE home page
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 24/29
It's not a deploment tool, is an open-source Webinterface that supports Apache Hadoop and itsecosystem, licensed under the Apache v2 license. HUEis used for Hadoop and its ecosystem user operations.For example HUE offers editors for Hive, Impala,Oozie, Pig, notebooks for Spark, Solr Searchdashboards, HDFS, YARN, HBase browsers..
Apache Mesos
Mesos is a cluster manager that provides resourcesharing and isolation across cluster applications. LikeHTCondor, SGE or Troque can do it. However Mesosis hadoop centred design
TODO
Myriad
Myriad is a mesos framework designed for scalingYARN clusters on Mesos. Myriad can expand or shrinkone or more YARN clusters in response to events asper configured rules and policies.
1. Myriad Github
Marathon
Marathon is a Mesos framework for long-runningservices. Given that you have Mesos running as thekernel for your datacenter, Marathon is the init orupstart daemon.
TODO
Brooklyn
Brooklyn is a library that simplifies applicationdeployment and management. For deployment, it isdesigned to tie in with other tools, giving single-clickdeploy and adding the concepts of manageable clustersand fabrics: Many common software entities availableout-of-the-box. Integrates with Apache Whirr -- andthereby Chef and Puppet -- to deploy well-knownservices such as Hadoop and elasticsearch (or usePOBS, plain-old-bash-scripts) Use PaaS's such asOpenShift, alongside self-built clusters, for maximumflexibility
TODO
Hortonworks HOYA
HOYA is defined as “running HBase On YARN”. TheHoya tool is a Java tool, and is currently CLI driven. Ittakes in a cluster specification – in terms of the numberof regionservers, the location of HBASE_HOME, theZooKeeper quorum hosts, the configuration that thenew HBase cluster instance should use and so on. So HOYA is for HBase deployment using a tooldeveloped on top of YARN. Once the cluster has beenstarted, the cluster can be made to grow or shrink usingthe Hoya commands. The cluster can also be stoppedand later resumed. Hoya implements the functionalitythrough YARN APIs and HBase’s shell scripts. Thegoal of the prototype was to have minimal codechanges and as of this writing, it has required zero codechanges in HBase.
1. Hortonworks Blog
Apache Helix
Apache Helix is a generic cluster managementframework used for the automatic management ofpartitioned, replicated and distributed resources hostedon a cluster of nodes. Originally developed byLinkedin, now is in an incubator project at Apache.Helix is developed on top of Zookeeper forcoordination tasks.
1. Apache Helix
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 25/29
Apache Bigtop Bigtop was originally developed and released as anopen source packaging infrastructure by Cloudera.BigTop is used for some vendors to build their owndistributions based on Apache Hadoop (CDH, PivotalHD, Intel's distribution), however Apache Bigtop doesmany more tasks, like continuous integration testing(with Jenkins, maven, ...) and is useful for packaging(RPM and DEB), deployment with Puppet, and so on.BigTop also features vagrant recipes for spinning up"n-node" hadoop clusters, and the bigpetstore blueprintapplication which demonstrates construction of a fullstack hadoop app with ETL, machine learning, anddataset generation. Apache Bigtop could be consideredas a community effort with a main focus: put all bits ofthe Hadoop ecosystem as a whole, rather thanindividual projects.
1. Apache Bigtop.
Buildoop
Buildoop is an open source project licensed underApache License 2.0, based on Apache BigTop idea.Buildoop is a collaboration project that providestemplates and tools to help you create custom Linux-based systems based on Hadoop ecosystem. Theproject is built from scrach using Groovy language,and is not based on a mixture of tools like BigTop does(Makefile, Gradle, Groovy, Maven), probably is easierto programming than BigTop, and the desing is focusedin the basic ideas behind the buildroot Yocto Project.The project is in early stages of development rightnow.
1. Hadoop EcosystemBuilder.
Deploop
Deploop is a tool for provisioning, managing andmonitoring Apache Hadoop clusters focused in theLambda Architecture. LA is a generic design based onthe concepts of Twitter engineer Nathan Marz. Thisgeneric architecture was designed addressing commonrequirements for big data. The Deploop system is inongoing development, in alpha phases of maturity. Thesystem is setup on top of highly scalable techologieslike Puppet and MCollective.
1. The HadoopDeploy System.
SequenceIQ Cloudbreak
Cloudbreak is an effective way to start and runmultiple instances and versions of Hadoop clusters inthe cloud, Docker containers or bare metal. It is a cloudand infrastructure agnostic and cost effictive HadoopAs-a-Service platform API. Provides automaticscaling, secure multi tenancy and full cloud lifecyclemanagement.
Cloudbreak leverages the cloud infrastructureplatforms to create host instances, uses Dockertechnology to deploy the requisite containers cloud-agnostically, and uses Apache Ambari (via AmbariBlueprints) to install and manage a Hortonworkscluster. This is a tool within the HDP ecosystem.
1. GitHub project. 2. Cloudbreakintroduction. 3. Cloudbreak inHortonworks.
Apache Eagle Apache Eagle is an open source analytics solution for 1. Apache Eagle
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 26/29
identifying security and performance issues instantlyon big data platforms, e.g. Hadoop, Spark etc. Itanalyzes data activities, yarn applications, jmx metrics,and daemon logs etc., provides state-of-the-art alertengine to identify security breach, performance issuesand shows insights. Big data platform normallygenerates huge amount of operational logs and metricsin realtime. Apache Eagle is founded to solve hardproblems in securing and tuning performance for bigdata platforms by ensuring metrics, logs alwaysavailable and alerting immediately even under hugetraffic.
Github Project. 2. Apache Eagle WebSite.
Applications
Apache Nutch
Highly extensible and scalable open source webcrawler software project. A search engine based onLucene: A Web crawler is an Internet bot thatsystematically browses the World Wide Web, typicallyfor the purpose of Web indexing. Web crawlers cancopy all the pages they visit for later processing by asearch engine that indexes the downloaded pages sothat users can search them much more quickly.
TODO
Sphinx Search Server
Sphinx lets you either batch index and search datastored in an SQL database, NoSQL storage, or just filesquickly and easily — or index and search data on thefly, working with Sphinx pretty much as with adatabase server.
1. Sphinx search website
Apache OODTOODT was originally developed at NASA JetPropulsion Laboratory to support capturing, processingand sharing of data for NASA's scientific archives
TODO
HIPI LibraryHIPI is a library for Hadoop's MapReduce frameworkthat provides an API for performing image processingtasks in a distributed computing environment.
TODO
PivotalR
PivotalR is a package that enables users of R, the mostpopular open source statistical programming languageand environment to interact with the Pivotal(Greenplum) Database as well as Pivotal HD / HAWQand the open-source database PostgreSQL for Big Dataanalytics. R is a programming language and dataanalysis software: you do data analysis in R by writingscripts and functions in the R programming language.R is a complete, interactive, object-oriented language:designed by statisticians, for statisticians. The languageprovides objects, operators and functions that make theprocess of exploring, modeling, and visualizing data anatural one.
1. PivotalR onGitHub
Development FrameworksJumbune Jumbune is an open source product that sits on top of
any Hadoop distribution and assists in developmentand administration of MapReduce solutions. Theobjective of the product is to assist analytical solutionproviders to port fault free applications on productionHadoop environments.
1. Jumbune 2. Jumbune GitHubProject 3. Jumbune JIRApage
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 27/29
Jumbune supports all active major branches of ApacheHadoop namely 1.x, 2.x, 0.23.x and commercialMapR, HDP 2.x and CDH 5.x distributions of Hadoop.It has the ability to work well with both Yarn and non-Yarn versions of Hadoop. It has four major modules MapReduce Debugger,HDFS Data Validator, On-demand cluster monitor andMapReduce job profiler. Jumbune can be deployed onany remote user machine and uses a lightweight agenton the NameNode of the cluster to relay relevantinformation to and fro.
Spring XD
Spring XD (Xtreme Data) is a evolution of Spring Javaapplication development framework to help Big DataApplications by Pivotal. SpringSource was thecompany created by the founders of the SpringFramework. SpringSource was purchased by VMwarewhere it was maintained for some time as a separatedivision within VMware. Later VMware, and its parentcompany EMC Corporation, formally created a jointventure called Pivotal. Spring XD is more thandevelopment framework library, is a distributed, andextensible system for data ingestion, real timeanalytics, batch processing, and data export. It could beconsidered as alternative to ApacheFlume/Sqoop/Oozie in some scenarios. Spring XD ispart of Pivotal Spring for Apache Hadoop (SHDP).SHDP, integrated with Spring, Spring Batch and SpringData are part of the Spring IO Platform as foundationallibraries. Building on top of, and extending thisfoundation, the Spring IO platform provides SpringXD as big data runtime. Spring for Apache Hadoop(SHDP) aims to help simplify the development ofHadoop based applications by providing a consistentconfiguration and API across a wide range of Hadoopecosystem projects such as Pig, Hive, and Cascading inaddition to providing extensions to Spring Batch fororchestrating Hadoop based workflows.
1. Spring XD onGitHub
Cask Data Application Platform
Cask Data Application Platform is an open sourceapplication development platform for the Hadoopecosystem that provides developers with data andapplication virtualization to accelerate applicationdevelopment, address a range of real-time and batchuse cases, and deploy applications into production. Thedeployment is made by Cask Coopr, an open sourcetemplate-based cluster management solution thatprovisions, manages, and scales clusters for multi-tiered application stacks on public and private clouds.Another component is Tigon, a distributed frameworkbuilt on Apache Hadoop and Apache HBase for real-time, high-throughput, low-latency data processing andanalytics applications.
1. Cask Site
Categorize Pending ...Apache Fluo Apache Fluo (incubating) is an open source 1. Apache Fluo Site
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 28/29
implementation of Percolator for Apache Accumulo.Fluo makes it possible to incrementally update theresults of a large-scale computation, index, or analyticas new data is discovered. Fluo allows processing newdata with lower latency than Spark or Map Reduce inthe case where all data must be reprocessed when newdata arrives.
2. Percolator Paper
Twitter Summingbird
A system that aims to mitigate the tradeoffs betweenbatch processing and stream processing by combiningthem into a hybrid system. In the case of Twitter,Hadoop handles batch processing, Storm handlesstream processing, and the hybrid system is calledSummingbird.
TODO
Apache Kiji Build Real-time Big Data Applications on ApacheHBase. TODO
S4 Yahoo
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmersto easily develop applications for processingcontinuous unbounded streams of data.
TODO
Metamarkers Druid Realtime analytical data store. TODO
Concurrent CascadingApplication framework for Java developers to simplydevelop robust Data Analytics and Data Managementapplications on Apache Hadoop.
TODO
Concurrent Lingual
Open source project enabling fast and simple Big Dataapplication development on Apache Hadoop. projectthat delivers ANSI-standard SQL technology to easilybuild new and integrate existing applications ontoHadoop
TODO
Concurrent Pattern Machine Learning for Cascading on Apache Hadoopthrough an API, and standards based PMML TODO
Apache Giraph
Apache Giraph is an iterative graph processing systembuilt for high scalability. For example, it is currentlyused at Facebook to analyze the social graph formed byusers and their connections. Giraph originated as theopen-source counterpart to Pregel, the graphprocessing architecture developed at Google
TODO
Talend
Talend is an open source software vendor that providesdata integration, data management, enterpriseapplication integration and big data software andsolutions.
TODO
Akka ToolkitAkka is an open-source toolkit and runtime simplifyingthe construction of concurrent applications on the Javaplatform.
TODO
Eclipse BIRTBIRT is an open source Eclipse-based reporting systemthat integrates with your Java/Java EE application toproduce compelling reports.
TODO
Spango BI SpagoBI is an Open Source Business Intelligence suite,belonging to the free/open source SpagoWorldinitiative, founded and supported by EngineeringGroup. It offers a large range of analytical functions, ahighly functional semantic layer often absent in other
TODO
6/16/2017 The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/ 29/29
open source platforms and projects, and a respectableset of advanced data visualization features includinggeospatial analytics
Jedox Palo
Palo Suite combines all core applications — OLAPServer, Palo Web, Palo ETL Server and Palo for Excel— into one comprehensive and customisable BusinessIntelligence platform. The platform is completelybased on Open Source products representing a high-end Business Intelligence solution which is availableentirely free of any license fees.
TODO
Twitter Finagle
Finagle is an asynchronous network stack for the JVMthat you can use to build asynchronous RemoteProcedure Call (RPC) clients and servers in Java,Scala, or any JVM-hosted language.
TODO
Intel GraphBuilder Library which provides tools to construct large-scalegraphs on top of Apache Hadoop TODO
Apache TikaToolkit detects and extracts metadata and structuredtext content from various documents using existingparser libraries.
TODO
Apache Zeppelin
Zeppelin is a modern web-based tool for the datascientists to collaborate over large-scale dataexploration and visualization projects. It is a notebookstyle interpreter that enable collaborative analysissessions sharing between users. Zeppelin isindependent of the execution framework itself. Currentversion runs on top of Apache Spark but it haspluggable interpreter APIs to support other dataprocessing systems. More execution frameworks couldbe added at a later date i.e Apache Flink, Crunch aswell as SQL-like backends such as Hive, Tajo, MRQL.
1. Apache Zeppelinsite
Hydrosphere Mist
Hydrosphere Mist is a service for exposing ApacheSpark analytics jobs and machine learning models asrealtime, batch or reactive web services. It acts as amiddleware between Apache Spark and machinelearning stack and user faced applications.
1. Hydrosphere Mistgithub
Published with GitHub Pages by Javi Roman, and contributors
Recommended