Ia Pe 2013-10-Building the BigData EDW

Embed Size (px)

DESCRIPTION

Ia Pe 2013-10-Building the BigData EDW

Citation preview

  • Building ABig Data Data Warehouse

    Integrating Structured and Unstructured Data

    DAMA IOWA October 2013

    Krish Krishnan Founder Sixth Sense Advisors Inc

  • Discussion Focus

    S Big data and the data warehousethe new landscape

    S Technology overview: Hadoop, NoSQL, Cassandra, BigQuery, Drill, Redshift, AWS (S3, EC2); programming with MapReduce; understanding analytical requirements, self-service discovery platforms

    S The challenges of data processing: Workloads; data management; infrastructure limitations

    S Next-generation data warehouse: Solution architectures; the three Ss: scalability, sustainability, and stability

    2 @2013 Copyright Sixth Sense Advisors

  • A New Landscape

    3 @2013 Copyright Sixth Sense Advisors

  • A Growing Trend

    @2013 Copyright Sixth Sense Advisors 4

    Requirement Expectations Reality

    Speed Speed of the Internet Speed = Infra + Arch + Design

    Accessibility Accessibility of a Smartphone

    BI Tool licenses & security

    Usability IPAD - Mobility Web Enabled BI Tool

    Availability Google Search Data & Report Metadata

    Delivery Speed of questions Methodology & Signoff

    Data Access to everything Structured Data

    Scalability Cloud (Amazon) Existing Infrastructure

    Cost Cell phone or Free WIFI Millions

    Expectations for BI are changing w/o anyone telling us

  • State of Data Today

    @2013 Copyright Sixth Sense Advisors 5

  • Data Growth Trends

    @2013 Copyright Sixth Sense Advisors 6

    Facebook has an average of 30 billion pieces of content added every month

    YouTube receives 24hours of video, every minute

    15 Billion mobile phones predicted to be in use in 2015

    A leading retailer in the UK collects 1.5 billion pieces of information to adjust prices and promotions

    Amazon.com: 30% of sales is out of its recommendation engine

    A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvement

    CERN Haldron Collider produces 15PB of data for each cycle of execution.

  • Decision Support = #Fail?

    S Decision support platforms of today are not satisfying the needs of the business user

    S Decisions being driven in the organization are not based on 360 degree views of the organization and its performance

    S Business transformations are not completely successful due to the lack of information presented in the Business Intelligence Architecture

    S Analytics and Key Performance Indicators are not available in a timely manner and the data that is presented is not sufficient to complete any business decisions with utmost confidence

    @2013 Copyright Sixth Sense Advisors 7

  • State of the Data Warehouse

    8 @2013 Copyright Sixth Sense Advisors

  • @2012 Copyright Sixth Sense Advisors

    What We Have Built

    @2013 Copyright Sixth Sense Advisors 9

  • Business Thinking

    @2013 Copyright Sixth Sense Advisors 10

    New Data Increasing Complexity

    Increase Quality of Service

    Increase Agility

    Digital Intelligence

    Customer Centric Cost driven

    TCO Opportunity Cost Competitive Cost

    Digital Connected

    Mobile Metrics Driven

    Big Data Social Media Corporate Data

    New Data Smarter Consumer

    Global Competition Cost

  • CIO Thinking

    @2013 Copyright Sixth Sense Advisors 11

  • Flexibility

    Reliability

    Simplicity

    Scalability

    Modularity

    Architects Thinking

    @2013 Copyright Sixth Sense Advisors 12

  • Users Needs

    @2013 Copyright Sixth Sense Advisors 13

    Every Data, All Shapes, Sizes and Formats Are Needed By The Users

  • Why The Database Alone Cannot Be The

    Platform The Limitations of Databases

    14 @2013 Copyright Sixth Sense Advisors

  • The Disappointment

    @2013 Copyright Sixth Sense Advisors 15

    S Distributed S Transactional Databases S Data Warehouses S Datamarts S Analytical Databases S CRM Databases S SCM Databases S ERP Databases

    S Redundant

    S Weak Metadata

    S Weak Integration

  • Base Graph Courtesy Dr. Richard Hackathorn

    Why The Data Warehouse Fails

    @2013 Copyright Sixth Sense Advisors 16

    Action time or Action distance Time

    Business Value

    Data Latency

    Analysis Latency

    Decision Latency

    Business Situation

    Data is ready

    Information is available

    Decision is made

    Los

    t V

    alue

    Lost value = Sum (Latencies)+ Opportunity Cost

  • Data Warehouse Computing Today

    @2013 Copyright Sixth Sense Advisors 17

    Transactional Systems

    ODS

    Enterprise Datawarehouse

    Datamarts & Analytical Databases

    Datamarts & Analytical Databases

    Datamarts & Analytical Databases

    Transactional Systems

    ODS

    Transactional Systems

    ODS

    Reports

    Dashboards

    Analytic Models

    Other Applications

    Data Transformation

  • The Bottom Line

    S We have designed, architected, deployed systems that have been built on architectures that were not intended to be used for complex processing and compute requirements

    S The real issue lies in the fact that the architectures that were designed for the RDBMS platform differ widely in their abilities to handle diverse types of workloads

    S In order to design and manage complex workloads, architects need to understand the underlying platforms capabilities with relation to the type of workload being designed

    @2013 Copyright Sixth Sense Advisors 18

  • Shared Everything Architecture

    S Resources are distributed and shared S CPUs are shared across the

    databases S Memory is shared across

    CPUs and databases S Disk architecture is shared

    across CPUs

    S Big disadvantage is the sharing of resources limits the scalability

    S Addition of the resources will not increase linear scalability and performance but only cost

    @2013 Copyright Sixth Sense Advisors 19

  • Issues

    S Shared Everything architecture cannot scale and handle workloads effectively

    S You cannot achieve 100% linear scalability in a shared architecture environment

    S Compute and store happen in disparate environments

    S Infrastructure limitations create more latencies in the overall system

    S Data Governance is complex subject area that adds to the weakness of the architecture

    @2013 Copyright Sixth Sense Advisors 20

  • BIG Data Example

    @2013 Copyright Sixth Sense Advisors 21

    To: [email protected] Dear Mr. Collins, This email is in reference to my bank account which has been efficiently handled by your bank for more than five years. There has been no problem till date until last week the situation went out of the hand. I have deposited one of my high amount cheque to my bank account no: 65656512 which was to be credited same day but due to your staff carelessness it wasnt done and because of this negligence my reputation in the market has been tarnished. Furthermore I had issued one payment cheque to the party which was showing bounced due to Insufficient balance just because my cheque didnt make on time. My relationship with your bank has matured with the time and its a shame to tell you about this kind of services are not acceptable when it is question of somebodys reputation. I hope you got my point and I am attaching a copy of the same for further rapid procedures and remit into my account in a day. Yours sincerely Daniel Carter Ph: 564-009-2311

  • Big Data Example

    S We will o2en imply addi6onal informa6on in spoken language by the way we place stress on words.

    S The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent diculty a natural language processor can have in parsing it. S "I never said she stole my money" - Someone else said it, but I didn't. S "I never said she stole my money" - I simply didn't ever say it. S "I never said she stole my money" - I might have implied it in some way, but I

    never explicitly said it. S "I never said she stole my money" - I said someone took it; I didn't say it was

    she. S "I never said she stole my money" - I just said she probably borrowed it. S "I never said she stole my money" - I said she stole someone else's money. S "I never said she stole my money" - I said she stole something, but not my

    money

    S Depending on which word the speaker places the stress, this sentence could have several dis6nct meanings.

    @2013 Copyright Sixth Sense Advisors 22 Example Source: Wikepedia

  • The Normal Way Results In

    @2013 Copyright Sixth Sense Advisors 23

  • Impact on Data Warehouse

    @2013 Copyright Sixth Sense Advisors 24

    New Data Types

    New volume

    New analytics

    New workload

    New metadata

    POOR Performance

    Failed Programs

    Scalability; Sharding; ACID;

    Why Big Data can Fail?

  • ACID is Not Good All The Time

    S Atomic All of the work in a transaction completes (commit) or none of it completes

    S Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.

    S Isolated The results of any changes made during a transaction are not visible until the transaction has committed.

    S Durable The results of a committed transaction survive failures

    @2013 Copyright Sixth Sense Advisors 25

  • Where Do we Go?

    @2013 Copyright Sixth Sense Advisors 26

    Tools

    instructions

    Data &

  • Next Generation Technologies

    Integrating Big Data

    27 @2013 Copyright Sixth Sense Advisors

  • Innovations

    @2013 Copyright Sixth Sense Advisors 28

    Category New Frontiers

    Infrastructure Big Data and Data Warehouse Appliances In-Memory Technologies SSD Storage Fast Networks Cloud Mobile Technologies

    Software In-memory Databases Hadoop, Cassandra & NoSQL Ecosystems Columnar DBMS Improved ETL-Hadoop integration Informatica, Talend

    Algorithms Mahout

    Pre-Configured Architectures

    IBM, Teradata, Kognitio, EMC, CloudEra, HortonWorks, Cirro, Intel, Cicso UCS, Pivotal, Oracle, MapR

  • BIG Data - Infrastructure Requirements

    S Scalable platform

    S Database independent

    S Fault tolerant

    S Low cost of acquisition

    S Scalable and Reliable Storage

    S Supported by standard toolsets

    S Datacenter Ready

    29 @2013 Copyright Sixth Sense Advisors

  • Big Data Workload Demands

    30 @2013 Copyright Sixth Sense Advisors

    S Process dynamic data content

    S Process unstructured data

    S Systems that can scale up with high volume data

    S Systems that can scale out with high volume of users

    S Perform complex operations within reasonable response time

  • Parallel databases

    S Shared-nothing MPP architecture (a collection of independent machines, each with local hard disk and main memory, connected together on high-speed network)

    S Machines are cheaper, lower-end, commodity hardware

    S Scales well up to a point, tens of nodes

    S Good performance

    S Poor fault tolerance

    S Problems with heterogeneous environment (machines must be equal in performance)

    S Good support for flexible query interface

    @2013 Copyright Sixth Sense Advisors 31

  • Data Warehouse Appliance

    High Availability

    Standard SQL Interface

    Advanced Compression

    MPP

    Leverages existing BI, ETL and OLTP investments

    Hadoop & MapReduce Interface / Embedded

    Minimal disk I/O bottleneck; simultaneously load & query

    Auto Database Management

    @2013 Copyright Sixth Sense Advisors 32

    A Data Warehouse (DW) Appliance is an integrated set of servers, storage, OS, database and interconnect specifically preconfigured and tuned for the rigors of data warehousing.

    DW appliances offer an attractive price / performance value proposition and are frequently a fraction of the cost of traditional data warehouse solutions.

  • Hadoop Evolution

    @2013 Copyright Sixth Sense Advisors 33

  • Hadoop

    @2013 Copyright Sixth Sense Advisors 34

  • Why Hadoop

    S Commodity HW S Built on inexpensive servers S Storage servers and their disks are not assumed to be highly reliable and available S Modular expansion

    S Metadata-data oriented design S Namenode maintains metadata S Datanodes manage data placement and store

    S Computation happens close to data S Servers have dual goals: data storage and computation S Single store and computevs. Separate clusters

    S File-System Architecture S Focus is mostly sequential access S Single writers S No file locking features

    35 @2013 Copyright Sixth Sense Advisors

  • Hadoop Architecture

    @2013 Copyright Sixth Sense Advisors 36

  • HDFS

    @2013 Copyright Sixth Sense Advisors 37

    S Hadoop Distributed File System S A scalable, Fault tolerant, High

    performance distributed file system

    S Asynchronous replication S Write-once and read-many

    (WORM)

    S No RAID required S Access from C, Java,Thrift S NameNode holds filesystem

    metadata

    S Files are broken up and spread over the DataNodes

  • HDFS Splits & Replication

    S Data is organized into files and directories

    S Files are divided into uniform sized blocks and distributed across cluster nodes

    S Blocks are replicated to handle hardware failure

    S Filesystem keeps checksums of data for corruption detection and recovery

    S HDFS exposes block placement so that computation can be migrated to data

    @2013 Copyright Sixth Sense Advisors 38

  • HDFS

    S Data Node S Stores data in HDFS S Can be found in multiples S Data is replicated across data

    nodes

    @2013 Copyright Sixth Sense Advisors 39

    S File size S A typical block size is 64MB (or

    even 128 MB).

    S A file is chopped into 64MB chunks and stored.

    S Name Node S The Name Node is the heartbeat of an HDFS file system. S It keeps the directory of all files in the file system, and tracks data

    distribution across the cluster the file. S It does not store the data of these files itself. S Cluster configuration management S Transaction Log management

    S Features S HDFS provides Java API for

    application to use.

    S Python access is also used in many applications.

    S A C language wrapper for Java API is also available.

    S A HTTP browser can be used to browse the files of a HDFS instance.

  • 14

    Data Correctness - File creation : Client computes checksum per 512 bytes DataNode stores the checksum - File Access : Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas Data Pipeline - Client retrieves a list of DataNodes on which to place replicas of a block - Client writes block to the first DataNode - The first DataNode forwards the data to the next DataNode in the Pipeline - When all replicas are written, the client moves on to write the next block in file

    Rebalancer - Usually run when new DataNodes are added - Cluster is online when Rebalancer is active - Rebalancer is throttled to avoid network congestion - Command line tool

    Blocks Placement - First replica on a node in a local rack - Second replica on different rack - 3rd replica on the same rack of the second replica - Clients read from nearest replica Heartbeats - DataNodes send heartbeat to the NameNode (once every 3 seconds) - NameNode used heartbeats to detect DataNode failure Replication Engine

    - -

    -

    Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes

    HDFS Features

    @2013 Copyright Sixth Sense Advisors 40

  • HBASE

    S Clone of Big Table (Google)

    S Implemented in Java (Clients : Java, C++,Ruby...)

    S Columnoriented data store

    S Distributed over many servers

    S Tolerant of machine failure

    S Layered over HDFS

    S Strong consistency

    S It's not a relational database (No joins)

    S Sparse data nulls are stored for free

    S Supports Semi-structured and unstructured data

    S Versioned data storage capability

    S Extremely Scalable Goal of billions of rows x millions of columns

    @2013 Copyright Sixth Sense Advisors 41

    S Hbase provides storage for the Hadoop Distributed Computing Environment.

    S Data is logically organized into tables, rows and columns.

  • Hive

    S Data summarization and ad-hoc query interface on top of Hadoop

    S MapReduce for Execution & HDFS for storage

    S Hive Query Language S Basic SQL : Select, From, Join, Group By S Equi-Join, Multi-Table Insert, Multi-Group-By S Batch query

    S MetaStore S Table/Partitions properties S Thrift API : Current clients in Php (Web S Interface), Python interface to Hive, Java

    (Query S Engine and CLI) S Metadata stored in any SQL backend

    @2013 Copyright Sixth Sense Advisors 42

    Image Cloudera Hive Tutorial

  • Hbase Hive Integration

    @2013 Copyright Sixth Sense Advisors 43

    HBase

    Hive table definitions

    Points to an existing table

    Points to some column

    Points to other columns, different names

  • Pig

    S Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

    S Pig generates and compiles a Map/Reduce program(s) on the fly.

    S Abstracts you from specific detail S Focus on data processing S Data flow S Built For data manipulation

    S Pig is workflow driven and is easy to maintain @2013 Copyright Sixth Sense Advisors 44

  • Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters Automatic data import SQL to Hadoop Easy import data from many databases to Hadoop Generates code for use in MapReduce applications Integrates with Hive

    Sqoop

    @2013 Copyright Sixth Sense Advisors 45

  • All servers store a copy of the data A leader is elected at startup Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the Change

    Zookeeper

    @2013 Copyright Sixth Sense Advisors 46

  • 24

    AVRO

    S A data serialization system that provides dynamic integration with scripting languages

    S Avro Data S Expressive S Smaller and Faster S Dyamic

    S Schema store with data S APIs permit reading and creating

    S Include a file format and a textual encoding S Generates JSON Metadata Automatically

    @2013 Copyright Sixth Sense Advisors 47

  • 24

    AVRO

    S Avro RPC S Leverage versioning support S For Hadoop service provide cross-language access

    @2013 Copyright Sixth Sense Advisors 48

  • 25

    A data collection system for managing large distributed systems Build on HDFS and MapReduce Tools kit for displaying, monitoring and analyzing the log files

    Chukwa

    @2013 Copyright Sixth Sense Advisors 49

  • Flume

    S Flume is: S A scalable, configurable, extensible and manageable distributed

    data collection service

    S Developed on Open source S One-stop solution for data collection of all formats S Flexible reliability guarantees allow careful performance tuning S Enables quick iteration on new collection strategies

    @2013 Copyright Sixth Sense Advisors 50

  • Oozie

    S Workflow Engine in Hadoop HTTP and command line interface + Web console

    S Used to S Execute and monitor workflows in Hadoop S Periodic scheduling of workflows S Trigger execution by data availability

    @2013 Copyright Sixth Sense Advisors 51

  • Hadoop Differentiator

    Schema-on-Write: RDBMS

    Schema-on-Read: Hadoop

    @2013 Copyright Sixth Sense Advisors 52

    Schema must be created before data is loaded.

    An explicit load operation has to take place which transforms the data to the internal structure of the database.

    New columns must be added explicitly before data for such columns can be loaded into the database.

    Read is Fast.

    Standards/Governance.

    Data is simply copied to the file store, no special transformation is needed.

    A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns.

    New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them.

    Load is Fast

    Evolving Schemas/Agility

  • HadoopDB

    S Recent study at Yale University, Database Research Dep.

    S Hybrid architecture of parallel databases and MapReduce system

    S The idea is to combine the best qualities of both technologies

    S Multiple single-node databases are connected using Hadoop as the task coordinator and network communication layer

    S Queries are distributed across the nodes by MapReduce framework, but as much work as possible is done in the database node

    @2013 Copyright Sixth Sense Advisors 53 Slide Courtsey: Dr.Daniel Abadi

  • HadoopDB architecture

    @2013 Copyright Sixth Sense Advisors 54 Slide Courtsey: Dr.Daniel Abadi

  • Hadoop Limitations

    S Write-once model

    S A namespace with an extremely large number of files exceeds Namenodes capacity to maintain

    S Cannot be mounted by existing OS S Getting data in and out is tedious S Virtual File System can solve problem

    S HDFS does not implement / support S User quotas S Access permissions S Hard or soft links S Data balancing schemes

    S No periodic checkpoints

    @2013 Copyright Sixth Sense Advisors 55

  • Hadoop Tips

    S Hadoop is useful S When you must process lots of unstructured

    data S When running batch jobs is acceptable S When you have access to lots of cheap

    hardware

    S Hadoop is not useful S For intense calculations with little or

    no data S When your data is not self-contained S When you need interactive results

    @2013 Copyright Sixth Sense Advisors 56

    Implementation Think big, start small Build on agile cycles Focus on the data, as you will always

    develop schema on write.

    Available Optimizations Input to Maps Map only jobs Combiner Compression Speculation Fault Tolerance Buffer Size Parallelism (threads) Partitioner Reporter DistributedCache Task child environment

    settings

  • Hadoop Tips

    S Performance Tuning S Increase the memory/buffer

    allocated to the tasks S Increase the number of tasks that

    can be run in parallel S Increase the number of threads

    that serve the map outputs S Disable unnecessary logging S Turn on speculation S Run reducers in one wave as they

    tend to get expensive S Tune the usage of

    DistributedCache, it can increase efficiency

    S Troubleshooting S Are your partitions uniform? S Can you combine records at the

    map side? S Are maps reading off a DFS block

    worth of data? S Are you running a single reduce

    wave (unless the data size per reducers is too big) ?

    S Have you tried compressing intermediate data & final data?

    S Are there buffer size issues S Do you see unexplained long

    tails S Are your CPU cores busy? S Is at least one system resource

    being loaded?

    @2013 Copyright Sixth Sense Advisors 57

  • MapReduce

    S Developed for processing large data sets.

    S Contains Map and Reduce functions.

    S Runs on a large cluster of machines.

    S Goals S Use machines across the data center S Elastic scaling S Finite programming model

    @2013 Copyright Sixth Sense Advisors 58

  • Input | Map() | Copy/Sort | Reduce() | Output

    Map Phase

    Raw data analyzed and converted to name/value pair

    Shuffle Phase

    All name/value pairs are sorted and grouped by their keys

    Reduce Phase

    All values associated with a key are processed for results

    MapReduce

    @2013 Copyright Sixth Sense Advisors 59

  • Programming model

    S Input & Output: each a set of key/value pairs S Programmer specifies two functions: S map (in_key, in_value) -> list(out_key, intermediate_value)

    S Processes input key/value pair S Produces set of intermediate pairs S reduce (out_key, list(intermediate_value)) -> list(out_value)

    S Combines all intermediate values for a particular key

    S Produces a set of merged output values (usually just one)

    @2013 Copyright Sixth Sense Advisors 60

  • Example

    S Page 1: DAMA Conference is good

    S Page 2: There are good ideas presented at DAMA

    S Page 3: I like DAMA because of its variety of topics.

    @2013 Copyright Sixth Sense Advisors 61

  • Map output

    S Worker 1: S (DAMA1), (Conference 1), (is 1), (good 1).

    S Worker 2: S (There 1), (are 1), (good 1), (ideas 1), (presented 1), (at 1), (DAMA

    1).

    S Worker 3: S (I 1), (Like 1), (DAMA 1), (Because 1), (of 1), (its 1), (variety 1), (of

    1), (topics 1).

    @2013 Copyright Sixth Sense Advisors 62

  • Reduce Input

    S Worker 1: S (DAMA 1), (DAMA 1), (DAMA

    1)

    S Worker 2: S (is 1)

    S Worker 3: S (good 1), (good 1)

    S Worker 4: S (There 1)

    S Worker 5: S (ideas 1)

    S Worker 6: S (presented 1)

    S Worker 7: S (I 1)

    S Worker 8: S (like 1)

    S Worker 9: S (its 1)

    S Worker 10: S (variety 1)

    S Worker 11: S (Topics 1)

    @2013 Copyright Sixth Sense Advisors 63

  • Reduce Output

    S Worker 1: S (DAMA 3)

    S Worker 2: S (is 1)

    S Worker 3: S (good 2)

    S Worker 4: S (There 1)

    S Worker 5: S (ideas 1)

    S Worker 6: S (presented 1)

    S Worker 7: S (I 1)

    S Worker 8: S (like 1)

    S Worker 9: S (its 1)

    S Worker 10: S (variety 1)

    S Worker 11: S (Topics 1)

    @2013 Copyright Sixth Sense Advisors 64

  • MapReduce Strengths

    S Tunable S Fine grained Map and Reduce tasks S Improved load balancing S Faster recovery from failed tasks

    S Good fault tolerance S Can scale to thousands of nodes S Supports heterogeneous environments S Automatic re-execution on failure

    S Localized execution S With large data, eliminates bandwidth problem by scheduling execution close to

    location of data when possible

    S Map-Reduce + HDFS is a very effective solution for scaling in a distributed geographical environment

    65 @2013 Copyright Sixth Sense Advisors

  • NoSQL

    S Stands for Not Only SQL

    S Based on CAP Theorem

    S Usually do not require a fixed table schema nor do they use the concept of joins

    S All NoSQL offerings relax one or more of the ACID properties

    S NoSQL databases come in a variety of flavors S XML (myXMLDB, Tamino, Sedna) S Wide Column (Cassandra, Hbase, Big Table) S Key/Value (Redis, Memcached with BerkleyDB) S Graph (neo4j, InfoGrid) S Document store (CouchDB, MongoDB)

    @2013 Copyright Sixth Sense Advisors 66

  • NoSQL

    @2013 Copyright Sixth Sense Advisors 67

    Size

    Complexity

    Amazon Dynamo

    Google Big Table

    Cassandra

    Lotus Notes HBase

    Voldermort

    Graph Theory

  • Approaches to CAP

    68

    S Eric Brewer stated in 2000 at PODC that S You have to give up one

    of the following in a distributed system : S Consistency of data S Availability S Partition tolerance

    S BASE S No ACID, use a single version of DB,

    reconcile later

    S Defer transaction commit S Until partitions fixed and replicate can run

    S Eventual consistency (e.g., Amazon Dynamo) S Eventually, all copies of an object converge

    S Restrict transactions (e.g., Sharded MySQL) S 1-M/c Xacts: Objects in xact are on the same

    machine S 1-Object Xacts: Xact can only read/write 1

    object

    S Object timelines (PNUTS)

    @2013 Copyright Sixth Sense Advisors

  • Consistency Model

    S If copies are asynchronously updated, what can we say about stale copies? S ACID guarantees require synchronous updts S Eventual consistency: Copies can drift apart, but will

    eventually converge if the system is allowed to quiesce S To what value will copies converge? S Do systems ever quiesce?

    S Is there any middle ground?

    @2013 Copyright Sixth Sense Advisors 69

  • Consistency Techniques S Per-record mastering

    S Each record is assigned a master region S May differ between records

    S Updates to the record forwarded to the master region S Ensures consistent ordering of updates

    S Tablet-level mastering S Each tablet is assigned a master region S Inserts and deletes of records forwarded to the master region S Master region decides tablet splits

    S These details are hidden from the application S Except for the latency impact!

    @2013 Copyright Sixth Sense Advisors 70

  • HBASE

    71 @2013 Copyright Sixth Sense Advisors 71

  • Architecture

    @2013 Copyright Sixth Sense Advisors 72

    Disk

    HRegionServer

    Client Client Client Client Client

    HBaseMaster

    REST API

    Disk

    HRegionServer

    Disk

    HRegionServer

    Disk

    HRegionServer

    Java Client

  • HRegion Server S Records partitioned by column family into HStores

    S Each HStore contains many MapFiles

    S All writes to HStore applied to single memcache

    S Reads consult MapFiles and memcache

    S Memcaches flushed as MapFiles (HDFS files) when full

    S Compactions limit number of MapFiles

    @2013 Copyright Sixth Sense Advisors 73

    HRegionServer

    HStore

    MapFiles

    Memcache writes

    Flush to disk reads

  • Pros and Cons

    S Pros S Log-based storage for high write throughput S Elastic scaling S Easy load balancing S Column storage for OLAP workloads

    S Cons S Writes not immediately persisted to disk S Reads cross multiple disk, memory locations S No geo-replication S Latency/bottleneck of HBaseMaster when using

    REST

    @2013 Copyright Sixth Sense Advisors 74

  • CASSANDRA

    @2013 Copyright Sixth Sense

    Advisors 75 75

  • Architecture

    S Facebooks storage system S BigTable data model S Dynamo partitioning and consistency model S Peer-to-peer architecture

    @2013 Copyright Sixth Sense Advisors 76

    Cassandra node

    Disk

    Cassandra node

    Disk

    Cassandra node

    Disk

    Cassandra node

    Disk

    Client Client Client Client Client

  • Routing

    S Consistent hashing, like Dynamo or Chord S Server position = hash(serverid) S Content position = hash(contentid) S Server responsible for all content in a hash interval

    @2013 Copyright Sixth Sense Advisors 77

    Server

    Responsible hash interval

  • Cassandra Server

    S Writes go to log and memory table

    S Periodically memory table merged with disk table

    @2013 Copyright Sixth Sense Advisors 78

    Cassandra node

    Disk

    RAM

    Log SSTable file

    Memtable

    Update

    (later)

  • Pros and Cons S Pros

    S Elastic scalability S Easy management

    S Peer-to-peer configuration S BigTable model is nice

    S Flexible schema, column groups for partitioning, versioning, etc. S Eventual consistency is scalable

    S Cons S Eventual consistency is hard to program against S No built-in support for geo-replication S Load balancing? S System complexity

    S P2P systems are complex; have complex corner cases

    @2013 Copyright Sixth Sense Advisors 79

  • Cassandra Tips

    Tunable memtable size Can have large memtable flushed less frequently, or small memtable

    flushed frequently

    Tradeoff is throughput versus recovery time Larger memtable will require fewer flushes, but will take a long time to

    recover after a failure

    With 1GB memtable: 45 mins to 1 hour to restart Can turn off log flushing

    Risk loss of durability Replication is still synchronous with the write

    Durable if updates propagated to other servers that dont fail

    @2013 Copyright Sixth Sense Advisors 80

  • NoSQL

    @2013 Copyright Sixth Sense Advisors 81

    Best Practices Design for data collection Plan the data store Organize by type and semantics Partition for performance

    Access and Query is run time dependent

    Horizontal scaling Memory Cachin

    Access and Query RESTful interfaces (HTTP as an

    accessAPI) Query languages other than SQL

    SPARQL - Query language for the SemanticWeb

    Gremlin - the graph traversal language

    Sones Graph Query Language Data Manipulation / Query API

    The Google BigTable DataStoreAPI

    The Neo4jTraversalAPI Serialization Formats

    JSON Thrift ProtoBuffers RDF

  • Forest Rim Technology Textual ETL Engine (TETLE) is an integration tool for turning text into a structure of data that can be analyzed by standard analytical tools

    Textual ETL Engine

    @2013 Copyright Sixth Sense Advisors 82

    Textual ETL Engine provides a robust user interface to define rules (or patterns / keywords) to process unstructured or semi-structured data.

    The rules engine encapsulates all the complexity and lets the user define simple phrases and keywords

    Easy to implement and easy to realize ROI

    Advantages Simple to use No MR or Coding required for text analysis

    and mining Extensible by Taxonomy integration Works on standard and new databases Produces a highly columnar key-value store,

    ready for metadata integration

    Disadvantages Not integrated with Hadoop as a rules

    interface Currently uses Sqoop for metadata

    interchange with Hadoop or NoSQL interfaces

    Current GA does not handle distributed processing outside Windows platform

  • Amazon RedShift

    S Goal 1 - Reduce I/O S Direct-attached storage S Large data block sizes S Columnar storage

    83

    S The industrys first large scale Data Warehouse As A Service.

    S Designed and Architected For Petabyte Scale Deployment

    S Goal 2 Optimize Hardware S Optimized for I/O intensive

    workloads

    S High disk density S Runs in fast network - HPC

    @2013 Copyright Sixth Sense Advisors

    S Goal 3 Extreme Parallelism Increased speed and efficiency

    S Loading S Querying S Backup S Restore

  • SQL Clients / BI Tools

    Leader Node

    RedShift Architecture

    Picture Amazon Presentation on RedShift - Internet

    @2013 Copyright Sixth Sense Advisors 84

  • Deployment Options

    S Can be hosted with RDBMS on-site and RedShift on the Cloud

    85 @2013 Copyright Sixth Sense Advisors

  • Deployment Options

    S Can be used as Live Archive on the Cloud

    86 @2013 Copyright Sixth Sense Advisors

  • Deployment Options

    S Can be used as ETL for Big Data on the Cloud

    87 @2013 Copyright Sixth Sense Advisors

  • Big Data Technologies

    S Apache Software Foundation S Hadoop S HBASE S Zookeeper S Oozie S Avro S Pig S Sqoop S Flume S Cassandra

    S CloudEra

    S HortonWorks

    S MongoDB

    S IBM BigInsights

    S EMC Pivotal

    S Teradata Aster Big Data Appliance

    S Oracle Big Data Appliance

    S Intel Hadoop Distribution

    S MapR

    S Datastax

    S Rainstor

    S QueryIO

    @2013 Copyright Sixth Sense Advisors

  • Workloads, Architectures,

    Computing

    89 @2013 Copyright Sixth Sense Advisors

  • Workload

    @2013 Copyright Sixth Sense Advisors 90

    S Defined as the usage of resources including CPU, Disk and Memory by every query ETL, ELT, BI and Analytics

    S Often misunderstood as a Database capability

    S Mostly touted by vendors as a differentiator for their platform

  • Workload

    S Loading S Continuous (near real-time) S Batch S Micro Batch

    S Queries S Tactical S AdHoc S Analytical S Dashboard

    @2013 Copyright Sixth Sense Advisors 91

    MIXED Workload

  • What Are You Trying to Do?

    @2013 Copyright Sixth Sense Advisors 92

    Data Workloads

    OLTP (Random access to

    a few records)

    OLAP (Scan access to a large

    number of records)

    Read-heavy Write-heavy By rows By columns Unstructured

    Combined (Some OLTP and

    OLAP tasks)

  • Data Engineering vs. Analysis/Warehousing

    S Very different workloads, requirements S Warehoused data for analysis includes

    S Data from serving system S Click log streams S Syndicated feeds

    S Trend towards scalable stores with S Semi-structured data S Map-reduce

    S The result of analysis is stored in the Data Warehouse

    @2013 Copyright Sixth Sense Advisors 93

  • Workload Isolation

    S Assigning the appropriate systems and processes to manage workloads

    S Creates an interchangeable infrastructure

    S Provides for better scalability

    S Will create a heterogeneous configuration, can be deployed on a homogenized platform if desired

    @2013 Copyright Sixth Sense Advisors 94

  • Workload Isolation

    @2013 Copyright Sixth Sense Advisors 95

    Semi-Structured

    Data

  • Workload Isolation

    @2013 Copyright Sixth Sense Advisors 96

    Semi-Structured

    Data

  • Workload Isolation

    @2013 Copyright Sixth Sense Advisors 97

    Semi-Structured

    Data

  • Metadata

    S The key to the castle in integrating Big Data is metadata

    S Whatever the tool, technology and technique, if you do not know your metadata, your integration will fail

    S Semantic technologies and architectures will be the way to process and integrate the Big Data.

    S Business domain experts can identify large data patterns by association relationships with small metadata.

    @2013 Copyright Sixth Sense Advisors 98

  • The Big Data - Data Warehouse

    99 @2013 Copyright Sixth Sense Advisors

  • Multi-Tiered Workload

    @2013 Copyright Sixth Sense Advisors 100

    Application Unstructured Data ( File Based)

    Semi-Sturctured Data (File / Digital)

    Structured Data (Digital)

    Social Analytics, Behavior Analytics, Recommendation Engines, Sentiment Analytics, Fraud Detection

    Hadoop / NoSQL Hadoop / NoSQL RDBMS

    CRM, SalesForce, Marketing RDBMS

    Data Mining Hadoop / NoSQL Hadoop / NoSQL RDBMS

    System Characteristics Volume: Large Concurrency: Low Consolidation: App Specific Availability: High Updated: Near Real Time to Monthly

    Volume: Large Concurrency: Medium Consolidation/Integration: Variable Availability:Medium Updated: Near Real Time

    Volume: Large Concurrency: High Consolidation/Integration: High Availability: High Updated: Intra-Day & Daily

  • Reference Architecture

    @2013 Copyright Sixth Sense Advisors 101

  • Which Tool

    Application Hadoop NoSQL Textual ETL

    Machine Learning

    x x

    Sentiments x x x

    Text Processing x x x

    Image Processing

    x x

    Video Analytics x x

    Log Parsing x x x

    Collaborative Filtering

    x x x

    Context Search x

    Email & Content

    x @2013 Copyright Sixth Sense Advisors 102

  • Challenges

    S Resources Availability S MR is hard to implement S Speech to text

    S Conversation context is often missing S Quality of recording S Accent issues

    S Visual data tagging S Images S Text embedded within images

    S Metadata is not available S Data is not trusted S Content management platform capabilities S Ontologies Ambiguity S Taxonomy Integration

    @2013 Copyright Sixth Sense Advisors 103

  • Thank You

    @2013 Copyright Sixth Sense Advisors 104

    Krish Krishnan [email protected] Twitter Handle: @datagenius