Upload
roman-nikitchenko
View
170
Download
0
Embed Size (px)
Citation preview
Roman Nikitchenko, 09.10.2014
BIG.DATATechnology scope
2www.vitech.com.ua
Any real big data is just about DIGITAL LIFE FOOTPRINT
3www.vitech.com.ua
BIG DATA is not about the
data. It is about OUR ABILITY TO HANDLE THEM.
4www.vitech.com.ua
Arguments for meetings with management ;-)
But we are always special, don't you?
What is our stack of big data technologies?
Our stack
Some of our specifics
Couple of buzz words
5www.vitech.com.ua
YARN
Linear scalability: 2 times more power costs
2 times more money
No natural keys so load balancing is perfect
No 'special' hardware so staging is closer to
production.
6www.vitech.com.ua
HADOOP magic is here!
7www.vitech.com.uaWhat is
it?
What is HADOOP?
● Hadoop is open source framework for big data. Both distributed storage and processing.
● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.
● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.
8www.vitech.com.ua
Why hadoop
?
x MAX+
=
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
What is HADOOP INDEED?
9www.vitech.com.ua
SIMPLE BUT RELIABLE
● Really big amount of data stored in reliable manner.
● Storage is simple, recoverable and cheap (relatively).
● The same is about processing power.
10www.vitech.com.ua
COMPLEX INSIDE, SIMPLE OUTSIDE● Complexity is burried
inside. Most of really complex operations are taken by engine.
● Interface is remote, compatible between versions so clients are relatively safe against implementation changes.
11www.vitech.com.ua
DECENTRALIZED● No single point of failure
(almost).
● Scalable as close to linear as possible.
● No manual actions to recover in case of failures
12www.vitech.com.ua
Hadoop historical top view
● HDFS serves as file system layer
● MapReduce originally served as distributed processing framework.
● Native client API is Java but there are lot of alternatives.
● This is only initial architecture and it is now more complex.
13www.vitech.com.ua
HDFS top
view
● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where.
● Actual work is performed by data nodes.
HDFS is... scalable
14www.vitech.com.ua
● Files are stored in large enough blocks. Every block is replicated to several data nodes.
● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode.
● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.
HDFS is... reliable
15www.vitech.com.ua
NO BACKUPS
16www.vitech.com.ua
● 2 steps data processing model: transform and then reduce. Really nice to do things in distributed manner.
● Large class of jobs can be adopted but not all of them.
MapReduce is...
17www.vitech.com.ua
BIG DATA
processing:
requirements
● Work is to be balanced.
● Work can be shared in accordance to data placement.
● Work is to be balanced to reflect resource balance.
DISTRIBUTION LOAD HAS TO BE SHARED
18www.vitech.com.ua
DATA LOCALITY TOOLS ARE TO BE CLOSE TO WORK PLACE● Process data on the
same nodes as it is stored on with MapReduce.
● Distributed storage — distributed processing.
19www.vitech.com.ua
DISTRIBUTION + LOCALITY TOGETHER THEY GO!YOUR DATA
BIG DATA
BIG DATA
Partition
Partition
WORK TO DO
Do it locally
BIG DATA
Share it
JOINED RESULT
Partition
Data partitioning drives work sharing. Good partitioning — good scalability.
20www.vitech.com.ua
● New component (YARN) forms resource management layer and completes real distributed data OS.
● MapReduce is from now only one among other YARN appliactions.
Now with resource management
21www.vitech.com.ua
● Better resource balance for heterogeneous clusterss and multple applications.
● Dynamic applications over static services.
● Much wider applications model over simple MapReduce. Things like Spark ot Tez.
Why YARN is SO important?
22www.vitech.com.ua
First ever worldDATA OS
10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.
23www.vitech.com.ua
Hadoop: don't do it
yourself
24www.vitech.com.ua
● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet.
Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.98.x. Balance. Spark 1.x is bold move!
● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.
Choose your destiny! We did.
25www.vitech.com.ua
HBase motivat
ion
● Designed for throughput, not for latency.
● HDFS blocks are expected to be large. There is issue with lot of small files.
● Write once, read many times ideology.
● MapReduce is not so flexible so any database built on top of it.
● How about realtime?
But Hadoop is...
26www.vitech.com.ua
HBase motivat
ion
BUT WE OFTEN NEED...
LATENCY, SPEED and all Hadoop properties.
27www.vitech.com.ua
High layer applications
Resource management
Distributed file system
YARN
28www.vitech.com.ua
Table
Region
Region
Row
Key Family #1 Family #2 ...Column Column ... ...
...
...
...
Data is placed in tables.
Tables are split into regions based on row key ranges.
Columns are grouped into families.Every table row
is identified by unique row key.
Every row consists of columns.
Logical data model
29www.vitech.com.ua
Table
Region
RegionRow
Key Family #1 Family #2 ...Column Column ... ...
...
● Data is stored in HFile.● Families are stored on
disk in separate files.● Row keys are
indexed in memory.● Column includes key,
qualifier, value and timestamp.● No column limit.● Storage is block based.
HFile: family #1
Row key Column Value TS
... ... ... ...
... ... ... ...
HFile: family #2
Row key Column Value TS
... ... ... ...
... ... ... ...
● Delete is just another marker record.
● Periodic compaction is required.
Real data model
30www.vitech.com.ua
DATA
META
RS RS RS RS
ClientMasterZookeeper
Zookeeper coordinates distributed elements and is primary contact point for client.
Master server keeps metadata and manages data distribution over Region servers.
Region servers manage data table regions.
Clients directly communicate with region server for data.
Clients locate master through ZooKeeper then needed regions through master.
Hbase: infrastructure view
31www.vitech.com.ua
DATA
META
Rack
DN DN
RS RS
Rack
DN DN
RS RS
Rack
DN DN
RS RSNameNode
Client
MasterZookeeper
Zookeeper coordinates distributed elements and is primary contact point for client.
Master server keeps metadata and manages data distribution over Region servers.
Region servers manage data table regions.
Actual data storage service including replication is on HDFS data nodes.
Clients directly communicate with region server for data.
Clients locate master through ZooKeeper then needed regions through master.
Together with HDFS
32www.vitech.com.ua
DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.
33www.vitech.com.uaZookee
per
… because coordinating distributed systems is a Zoo
Apache ZooKeeper
34www.vitech.com.ua
Apache ZooKeeper
We use this guy:● As a part of Hadoop /
HBase infrastructure● To coordinate MapReduce
job tasks
35www.vitech.com.ua
Apache Spark
● Better MapReduce with at least some MapReduce elements able to be reused.
● Dynamic, faster to startup and does not need anything from cluster.
● New job models. Not only Map and Reduce.
● Results can be passed through memory including final one.
36www.vitech.com.ua
● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX
● But it can index ANYTHING. Search result is document ID
INDEX UPDATE
Search responses
INDEX QUERY
Index update request is analyzed, tokenized,
transformed... and the same is for queries.
SOLR is just about search
37www.vitech.com.ua
● HBase handles user data change online requests.
● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.
● Indexes are built on SOLR so HBase data are searchable.
38www.vitech.com.ua
ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.
39www.vitech.com.ua
HDFS
HBase: Data and search integration
HBase regions
Data update
Client
User just puts (or deletes) data.
Search responses
Lily HBase NRT indexer
Replication can be set up to column
family level.
REPLICATIONHBasecluster
Translates data changes into SOLR
index updates.
SOLR cloudSearch requests (HTTP)
Apache Zookeeper does all coordination
Finally provides search
Serves low level file system.
40www.vitech.com.ua
Questions and discussion