Demystifying the Big Data Ecosystem... - Param Natarajan

  • Upload
    param

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    1/8

    Demystifying the Big Data Ecosystem... 

     When I started reading up on big data I was perplexed with the

    terminologies and the vast ecosystem that surrounded it.This article is a

    humble attempt to demystify this complex landscape and will give you just

    enough information to start relating to some terms, terminologies and

    technologies surrounding the BIG DATA ecosystem.

     You will find it useful as a starter kit, whether you are embarking on a new

    IOT project or trying to build out a data lake for your enterprise. I have

    stuck with the open source options available for ease of explanation...

    1.  Let’s start simple ....Big Data is data so big that it cannot fit or be

    processed into one box and it cannot travel from one place to anotheras a whole

    2.  Since it cannot fit into one box, each file has to be split and spread

    across multiple boxes using a distributed file system called HDFS.

    3.  HDFS hence contains data node that stores the splits.

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    2/8

    4.  Named Node is a book keeper that keeps track of which split resides

    on which data node at the minimum. Named Node is periodically

     backed up by Secondary named node as most of the data is stored in

    memory for performance reasons.

    5.  HDFS also replicates the data and stores the replica across nodes and

    data centre racks to improve reliability against node failures.

    6.  Hadoop 1.0 ecosystem consists of HDFS (Named Node, Data Node

    and Secondary Named Node), task scheduler called Job Tracker and

    a task executor called Task Tracker.

    7.  Task Tracker runs tasks allocated by job tracker in a separate

    JVM directly on the data node. While scheduling tasks on the task

    tracker the job tracker keeps data locality in mind to avoid network

    latency.

    8.  Hadoop 2.0 has alleviated several scale and performance issues that

     was present on Hadoop 1.0. The job tracker in Hadoop used to

    perform scheduling, monitoring and job history tracking functions,

    this responsibility is distributed now between Resource Manager,

     Application Master and Timeline Server. Task Trackers are now

    replaced with Node Managers which run application resource

    containers which is an upgrade over splitting the cluster capacity asmap or reduce slots from Hadoop 1.0

    9.  Named Node was single point of failure in Hadoop 1.0 .As of Hadoop

    2.0 it is now highly available using a cluster capability called Quorum

    Journal Manager along with Zookeeper that can help manage leader

    election

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    3/8

    10.  Named Nodes can be federated in Hadoop 2.0 with each named node

    managing a part of the group of files in a massive cluster, thereby

    improving scale.

    11.  The servers that host the Hadoop components are called commodity

    servers. Commodity servers doesn't mean cheap low end servers. It

    means a device that is relatively inexpensive, widely available and

    interchangeable with other hardware of its type.

    12.  The definition of commodity servers are changing year on

     year because essentially, at the same price-point, the processing

    power available in data-centres continues to increase rapidly. As an

    example, consider the following definitions of commodity servers,

    2009 – 8 cores, 16GB of RAM, 4x1TB disk, 2012 – 16+ cores, 48-

    96GB of RAM, 12x2TB or 12x3TB of disk

    13.  Since the data cannot travel, tasks have to come to the data that

    resides on a data node to perform processing on each split, convert

    each split into key value pairs (map) and then aggregate the split

     based on splits based on key to perform meaningful computation.

    This job is called Map Reduce and as of Hadoop 1.0 involves Job

    tracker(scheduler) and task tracker(worker that works on the data

    splits)

    14.  In the past we had to model the data before ingesting it, now with Big

    data , structured (database) , semi structured(log files) and

    unstructured(images) can become part of your data analysis

    landscape .The sources of data can range from log files , sensors ,

    click stream to databases .

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    4/8

    15.  Every organization aggregates and analyses this data in a central

    place called Data Lake. Data Lake is different from a data warehouse

    in a way that data warehouse can only store structured and modelled

    data, while data lake can store unstructured, structured and semi

    structured data as well.

    16.  Data can be stored into data lake in a plain text format , but it is

    preferred to store it in a compressed , split table and binary format to

    save space and exploit the underlying power of HDFS and Map

    reduce

    17.  Also given the changing nature of data file formats, these

    formats have to support schema evolution. It is also desirable that

    every data set comes with its own schema which makes the data set

    self-describing.

    18.  Hadoop provides a key value file format called Sequence file format

    for this purpose but it is limited only to java programs .There are

    other file formats like Avro , Thrift and Parquet that supports both

    reading and writing files in these formats to HDFS in a

    language agnostic way.

    19.  These data formats that are split table can be further compressed

    using a compression codec like Snappy or Bzip , which allows them tooccupy much lesser space and reduce network bandwidth when

    shuffling data across the nodes during job execution thereby

    improving performance . Please note that compression is a CPU

    intensive process and may require more CPU processing and is more

    suitable for I/O intensive jobs like Map reduce.

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    5/8

    20. Structured and semi structured batch data that flows into the lake

    from database , Ftp servers or Mainframes can be ingested using

    tools like Sqoop ,which fetches data from the database and loads the

    data in parallel into HDFS without overwhelming the database

    21.  Unstructured and semi structured data can flow into HDFS via

    streaming tools like Kafka , Storm, Spark Streams and Flume

    22. Data stored in HDFS in the form of files are not reporting friendly

    and hence need to loaded into other big data stores for CRUD and

    reporting purpose.

    23.  For, CRUD operations you can use a columnar structure database

    called HBASE that stores data into HDFS

    24. For SQL and analytical capabilities, connect Hive/Accumulo to

    HBASE or HDFS to perform aggregation and joins

    25.  Every entity will be represented in 100 different ways within an

    organisation , it is best to transform these entities into a canonical

    structure that can be used by the consumers .For Transformation of

    data that resides in HDFS into a canonical format you can use PIG

     which uses Hadoop and Map Reduce jobs underneath or Spark

     workflows

    26. Integrate Sqoop , Pig , Hive , Map Reduce and HDFS jobs into a

    single workflow which can be scheduled or run on demand with tools

    like Oozie.

    27.  Oozie can schedule jobs based on certain recurring time interval and

    frequency or based on data availability from the upstream

    system(this is a powerful feature)

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    6/8

    28. For free text searching on unstructured files move the data into

    Elastic or Solr.

    29. Spark provides a tightly integrated environment to ingest, transform

    and load the data from variety of big data sources into a variety of big

    data stores. It abstracts the fact that the input file in HDFS is split

    into variety of splits that is distributed across multiple nodes using an

    abstraction called RDD's (Resilent Distributed Dat Sets) and hence is

    rising to popularity quickly as a single shop stop.

    30. Tabulae , Kibana can connect to Hive/Accumulo and Elastic to

    provide dash boarding and visualisation capabilities for big data .

    31.  Kafka is a high performance messaging middleware that can buffer

    huge amounts of data in the topics .It is quite useful when the

    producer is sending data faster than the consumer can consume

     which is usually the case

    32.  Flume on the other hand is the standard for fetching data from

    streaming sources like syslogs, directories and http. It comes with a

    plethora of adapters that allows data to be pulled from various

    sources, buffer it into a messaging channel (file or memory based)

    and write it into sinks like HDFS and HBASE in a format that is

    convenient to you.

    33.  Processing streams involves data being read from various devices

    using device adapters, This data is then pushed into the

    organization using various protocols via Flume into Kafka, the data is

    then fetched from Kafka either by Apache Storm or Spark (as

    Dstreams) and further processed .

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    7/8

    34. The outcome of a stream processing is a set of actions which may

    include interacting with the devices, sparking out some workflows or

    simply running some java , scala or shell programs.

    35.  Graph database like ne04j and titan can also be populated during

    ingestion phase and then later used for correlating and aggregating

    events during stream processing.

    36. Cipher and Gremlin are graph query language and can be used to

    query the graphs you created and also write rules on them

    37.  Correlation and aggregation of events can also be done using a

     window interval concept provided in tools like spark .A window

    interval is a set of micro batches that are a group of trickling events

    that arrive at a configured time interval. All events within a window

    can be processed in a single iteration.

    38. Machine learning algorithms can be applied on top of the streams

    using spark built in libraries to take actions on the streams that

    trickle in. However training the model has to be done using a batch

     job that is periodically kicked off using an Oozie workflow

    39.  Authentication in Hadoop ecosystem can be achieved using kerberos

    protocol, all devices in the Hadoop ecosystem needs to be a part of

    the trusted kerberos network.

    40. Kerberos can sync up with LDAP if that is the default store for users

     within the organisation. However for that to happen both LDAP and

    Keberos have to add each other to their trust stores

    41.  Streams of data that flow in from Flume or Sqoop can be encrypted

    using SSL .

  • 8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

    8/8

    42.  Authorization is more complex and happens at every component

    level. For example authorization in Hadoop is controlled by a policy

    file called policy.xml that stores what activities can be performed by

    users or group of users

    43. There are some incubating projects like Apache Knox and Apache

    Sentry that are trying to centralize and manage the security policies

    centrally

    Hope you find it useful!!! , thanks for reading, suggestions welcome and

    apologies in advance if I left out any important detail in the process.