Big data: current technology scope

Roman Nikitchenko, 09.10.2014

BIG.DATATechnology scope

2www.vitech.com.ua

Any real big data is just about DIGITAL LIFE FOOTPRINT

3www.vitech.com.ua

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

4www.vitech.com.ua

Arguments for meetings with management ;-)

But we are always special, don't you?

What is our stack of big data technologies?

Our stack

Some of our specifics

Couple of buzz words

5www.vitech.com.ua

YARN

Linear scalability: 2 times more power costs

2 times more money

No natural keys so load balancing is perfect

No 'special' hardware so staging is closer to

production.

6www.vitech.com.ua

HADOOP magic is here!

7www.vitech.com.uaWhat is

it?

What is HADOOP?

● Hadoop is open source framework for big data. Both distributed storage and processing.

● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.

● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

8www.vitech.com.ua

Why hadoop

?

x MAX+

=

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

What is HADOOP INDEED?

9www.vitech.com.ua

SIMPLE BUT RELIABLE

● Really big amount of data stored in reliable manner.

● Storage is simple, recoverable and cheap (relatively).

● The same is about processing power.

10www.vitech.com.ua

COMPLEX INSIDE, SIMPLE OUTSIDE● Complexity is burried

inside. Most of really complex operations are taken by engine.

● Interface is remote, compatible between versions so clients are relatively safe against implementation changes.

11www.vitech.com.ua

DECENTRALIZED● No single point of failure

(almost).

● Scalable as close to linear as possible.

● No manual actions to recover in case of failures

12www.vitech.com.ua

Hadoop historical top view

● HDFS serves as file system layer

● MapReduce originally served as distributed processing framework.

● Native client API is Java but there are lot of alternatives.

● This is only initial architecture and it is now more complex.

13www.vitech.com.ua

HDFS top

view

● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where.

● Actual work is performed by data nodes.

HDFS is... scalable

14www.vitech.com.ua

● Files are stored in large enough blocks. Every block is replicated to several data nodes.

● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode.

● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.

HDFS is... reliable

15www.vitech.com.ua

NO BACKUPS

16www.vitech.com.ua

● 2 steps data processing model: transform and then reduce. Really nice to do things in distributed manner.

● Large class of jobs can be adopted but not all of them.

MapReduce is...

17www.vitech.com.ua

BIG DATA

processing:

requirements

● Work is to be balanced.

● Work can be shared in accordance to data placement.

● Work is to be balanced to reflect resource balance.

DISTRIBUTION LOAD HAS TO BE SHARED

18www.vitech.com.ua

DATA LOCALITY TOOLS ARE TO BE CLOSE TO WORK PLACE● Process data on the

same nodes as it is stored on with MapReduce.

● Distributed storage — distributed processing.

19www.vitech.com.ua

DISTRIBUTION + LOCALITY TOGETHER THEY GO!YOUR DATA

BIG DATA

BIG DATA

Partition

Partition

WORK TO DO

Do it locally

BIG DATA

Share it

JOINED RESULT

Partition

Data partitioning drives work sharing. Good partitioning — good scalability.

20www.vitech.com.ua

● New component (YARN) forms resource management layer and completes real distributed data OS.

● MapReduce is from now only one among other YARN appliactions.

Now with resource management

21www.vitech.com.ua

● Better resource balance for heterogeneous clusterss and multple applications.

● Dynamic applications over static services.

● Much wider applications model over simple MapReduce. Things like Spark ot Tez.

Why YARN is SO important?

22www.vitech.com.ua

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

23www.vitech.com.ua

Hadoop: don't do it

yourself

24www.vitech.com.ua

● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet.

Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.98.x. Balance. Spark 1.x is bold move!

● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.

Choose your destiny! We did.

25www.vitech.com.ua

HBase motivat

ion

● Designed for throughput, not for latency.

● HDFS blocks are expected to be large. There is issue with lot of small files.

● Write once, read many times ideology.

● MapReduce is not so flexible so any database built on top of it.

● How about realtime?

But Hadoop is...

26www.vitech.com.ua

HBase motivat

ion

BUT WE OFTEN NEED...

LATENCY, SPEED and all Hadoop properties.

27www.vitech.com.ua

High layer applications

Resource management

Distributed file system

YARN

28www.vitech.com.ua

Table

Region

Region

Row

Key Family #1 Family #2 ...Column Column ... ...

...

...

...

Data is placed in tables.

Tables are split into regions based on row key ranges.

Columns are grouped into families.Every table row

is identified by unique row key.

Every row consists of columns.

Logical data model

29www.vitech.com.ua

Table

Region

RegionRow

Key Family #1 Family #2 ...Column Column ... ...

...

● Data is stored in HFile.● Families are stored on

disk in separate files.● Row keys are

indexed in memory.● Column includes key,

qualifier, value and timestamp.● No column limit.● Storage is block based.

HFile: family #1

Row key Column Value TS

... ... ... ...

... ... ... ...

HFile: family #2

Row key Column Value TS

... ... ... ...

... ... ... ...

● Delete is just another marker record.

● Periodic compaction is required.

Real data model

30www.vitech.com.ua

DATA

META

RS RS RS RS

ClientMasterZookeeper

Zookeeper coordinates distributed elements and is primary contact point for client.

Master server keeps metadata and manages data distribution over Region servers.

Region servers manage data table regions.

Clients directly communicate with region server for data.

Clients locate master through ZooKeeper then needed regions through master.

Hbase: infrastructure view

31www.vitech.com.ua

DATA

META

Rack

DN DN

RS RS

Rack

DN DN

RS RS

Rack

DN DN

RS RSNameNode

Client

MasterZookeeper

Zookeeper coordinates distributed elements and is primary contact point for client.

Master server keeps metadata and manages data distribution over Region servers.

Region servers manage data table regions.

Actual data storage service including replication is on HDFS data nodes.

Clients directly communicate with region server for data.

Clients locate master through ZooKeeper then needed regions through master.

Together with HDFS

32www.vitech.com.ua

DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.

33www.vitech.com.uaZookee

per

… because coordinating distributed systems is a Zoo

Apache ZooKeeper

34www.vitech.com.ua

Apache ZooKeeper

We use this guy:● As a part of Hadoop /

HBase infrastructure● To coordinate MapReduce

job tasks

35www.vitech.com.ua

Apache Spark

● Better MapReduce with at least some MapReduce elements able to be reused.

● Dynamic, faster to startup and does not need anything from cluster.

● New job models. Not only Map and Reduce.

● Results can be passed through memory including final one.

36www.vitech.com.ua

● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX

● But it can index ANYTHING. Search result is document ID

INDEX UPDATE

Search responses

INDEX QUERY

Index update request is analyzed, tokenized,

transformed... and the same is for queries.

SOLR is just about search

37www.vitech.com.ua

● HBase handles user data change online requests.

● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.

● Indexes are built on SOLR so HBase data are searchable.

38www.vitech.com.ua

ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

39www.vitech.com.ua

HDFS

HBase: Data and search integration

HBase regions

Data update

Client

User just puts (or deletes) data.

Search responses

Lily HBase NRT indexer

Replication can be set up to column

family level.

REPLICATIONHBasecluster

Translates data changes into SOLR

index updates.

SOLR cloudSearch requests (HTTP)

Apache Zookeeper does all coordination

Finally provides search

Serves low level file system.

40www.vitech.com.ua

Questions and discussion