Demystifying the Big Data Ecosystem... - Param Natarajan

8/20/2019 Demystifying the Big Data Ecosystem... - Param Natarajan

1/8

Demystifying the Big Data Ecosystem...

When I started reading up on big data I was perplexed with the

terminologies and the vast ecosystem that surrounded it.This article is a

humble attempt to demystify this complex landscape and will give you just

enough information to start relating to some terms, terminologies and

technologies surrounding the BIG DATA ecosystem.

You will find it useful as a starter kit, whether you are embarking on a new

IOT project or trying to build out a data lake for your enterprise. I have

stuck with the open source options available for ease of explanation...

1. Let’s start simple ....Big Data is data so big that it cannot fit or be

processed into one box and it cannot travel from one place to anotheras a whole

2. Since it cannot fit into one box, each file has to be split and spread

across multiple boxes using a distributed file system called HDFS.

3. HDFS hence contains data node that stores the splits.


2/8

4. Named Node is a book keeper that keeps track of which split resides

on which data node at the minimum. Named Node is periodically

backed up by Secondary named node as most of the data is stored in

memory for performance reasons.

5. HDFS also replicates the data and stores the replica across nodes and

data centre racks to improve reliability against node failures.

6. Hadoop 1.0 ecosystem consists of HDFS (Named Node, Data Node

and Secondary Named Node), task scheduler called Job Tracker and

a task executor called Task Tracker.

7. Task Tracker runs tasks allocated by job tracker in a separate

JVM directly on the data node. While scheduling tasks on the task

tracker the job tracker keeps data locality in mind to avoid network

latency.

8. Hadoop 2.0 has alleviated several scale and performance issues that

was present on Hadoop 1.0. The job tracker in Hadoop used to

perform scheduling, monitoring and job history tracking functions,

this responsibility is distributed now between Resource Manager,

Application Master and Timeline Server. Task Trackers are now

replaced with Node Managers which run application resource

containers which is an upgrade over splitting the cluster capacity asmap or reduce slots from Hadoop 1.0

9. Named Node was single point of failure in Hadoop 1.0 .As of Hadoop

2.0 it is now highly available using a cluster capability called Quorum

Journal Manager along with Zookeeper that can help manage leader

election


3/8

10. Named Nodes can be federated in Hadoop 2.0 with each named node

managing a part of the group of files in a massive cluster, thereby

improving scale.

11. The servers that host the Hadoop components are called commodity

servers. Commodity servers doesn't mean cheap low end servers. It

means a device that is relatively inexpensive, widely available and

interchangeable with other hardware of its type.

12. The definition of commodity servers are changing year on

year because essentially, at the same price-point, the processing

power available in data-centres continues to increase rapidly. As an

example, consider the following definitions of commodity servers,

2009 – 8 cores, 16GB of RAM, 4x1TB disk, 2012 – 16+ cores, 48-

96GB of RAM, 12x2TB or 12x3TB of disk

13. Since the data cannot travel, tasks have to come to the data that

resides on a data node to perform processing on each split, convert

each split into key value pairs (map) and then aggregate the split

based on splits based on key to perform meaningful computation.

This job is called Map Reduce and as of Hadoop 1.0 involves Job

tracker(scheduler) and task tracker(worker that works on the data

splits)

14. In the past we had to model the data before ingesting it, now with Big

data , structured (database) , semi structured(log files) and

unstructured(images) can become part of your data analysis

landscape .The sources of data can range from log files , sensors ,

click stream to databases .


4/8

15. Every organization aggregates and analyses this data in a central

place called Data Lake. Data Lake is different from a data warehouse

in a way that data warehouse can only store structured and modelled

data, while data lake can store unstructured, structured and semi

structured data as well.

16. Data can be stored into data lake in a plain text format , but it is

preferred to store it in a compressed , split table and binary format to

save space and exploit the underlying power of HDFS and Map

reduce

17. Also given the changing nature of data file formats, these

formats have to support schema evolution. It is also desirable that

every data set comes with its own schema which makes the data set

self-describing.

18. Hadoop provides a key value file format called Sequence file format

for this purpose but it is limited only to java programs .There are

other file formats like Avro , Thrift and Parquet that supports both

reading and writing files in these formats to HDFS in a

language agnostic way.

19. These data formats that are split table can be further compressed

using a compression codec like Snappy or Bzip , which allows them tooccupy much lesser space and reduce network bandwidth when

shuffling data across the nodes during job execution thereby

improving performance . Please note that compression is a CPU

intensive process and may require more CPU processing and is more

suitable for I/O intensive jobs like Map reduce.


5/8

20. Structured and semi structured batch data that flows into the lake

from database , Ftp servers or Mainframes can be ingested using

tools like Sqoop ,which fetches data from the database and loads the

data in parallel into HDFS without overwhelming the database

21. Unstructured and semi structured data can flow into HDFS via

streaming tools like Kafka , Storm, Spark Streams and Flume

22. Data stored in HDFS in the form of files are not reporting friendly

and hence need to loaded into other big data stores for CRUD and

reporting purpose.

23. For, CRUD operations you can use a columnar structure database

called HBASE that stores data into HDFS

24. For SQL and analytical capabilities, connect Hive/Accumulo to

HBASE or HDFS to perform aggregation and joins

25. Every entity will be represented in 100 different ways within an

organisation , it is best to transform these entities into a canonical

structure that can be used by the consumers .For Transformation of

data that resides in HDFS into a canonical format you can use PIG

which uses Hadoop and Map Reduce jobs underneath or Spark

workflows

26. Integrate Sqoop , Pig , Hive , Map Reduce and HDFS jobs into a

single workflow which can be scheduled or run on demand with tools

like Oozie.

27. Oozie can schedule jobs based on certain recurring time interval and

frequency or based on data availability from the upstream

system(this is a powerful feature)


6/8

28. For free text searching on unstructured files move the data into

Elastic or Solr.

29. Spark provides a tightly integrated environment to ingest, transform

and load the data from variety of big data sources into a variety of big

data stores. It abstracts the fact that the input file in HDFS is split

into variety of splits that is distributed across multiple nodes using an

abstraction called RDD's (Resilent Distributed Dat Sets) and hence is

rising to popularity quickly as a single shop stop.

30. Tabulae , Kibana can connect to Hive/Accumulo and Elastic to

provide dash boarding and visualisation capabilities for big data .

31. Kafka is a high performance messaging middleware that can buffer

huge amounts of data in the topics .It is quite useful when the

producer is sending data faster than the consumer can consume

which is usually the case

32. Flume on the other hand is the standard for fetching data from

streaming sources like syslogs, directories and http. It comes with a

plethora of adapters that allows data to be pulled from various

sources, buffer it into a messaging channel (file or memory based)

and write it into sinks like HDFS and HBASE in a format that is

convenient to you.

33. Processing streams involves data being read from various devices

using device adapters, This data is then pushed into the

organization using various protocols via Flume into Kafka, the data is

then fetched from Kafka either by Apache Storm or Spark (as

Dstreams) and further processed .


7/8

34. The outcome of a stream processing is a set of actions which may

include interacting with the devices, sparking out some workflows or

simply running some java , scala or shell programs.

35. Graph database like ne04j and titan can also be populated during

ingestion phase and then later used for correlating and aggregating

events during stream processing.

36. Cipher and Gremlin are graph query language and can be used to

query the graphs you created and also write rules on them

37. Correlation and aggregation of events can also be done using a

window interval concept provided in tools like spark .A window

interval is a set of micro batches that are a group of trickling events

that arrive at a configured time interval. All events within a window

can be processed in a single iteration.

38. Machine learning algorithms can be applied on top of the streams

using spark built in libraries to take actions on the streams that

trickle in. However training the model has to be done using a batch

job that is periodically kicked off using an Oozie workflow

39. Authentication in Hadoop ecosystem can be achieved using kerberos

protocol, all devices in the Hadoop ecosystem needs to be a part of

the trusted kerberos network.

40. Kerberos can sync up with LDAP if that is the default store for users

within the organisation. However for that to happen both LDAP and

Keberos have to add each other to their trust stores

41. Streams of data that flow in from Flume or Sqoop can be encrypted

using SSL .


8/8

42. Authorization is more complex and happens at every component

level. For example authorization in Hadoop is controlled by a policy

file called policy.xml that stores what activities can be performed by

users or group of users

43. There are some incubating projects like Apache Knox and Apache

Sentry that are trying to centralize and manage the security policies

centrally

Hope you find it useful!!! , thanks for reading, suggestions welcome and

apologies in advance if I left out any important detail in the process.

Documents

Demystifying the Big Data Ecosystem... - Param Natarajan