65
Emerging Trends in Big Data TU-20008 Peter Linnell Big Data Team @ SUSE Apache Bigtop PMC [email protected] [email protected]

Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

Emerging Trends in Big DataTU-20008

Peter LinnellBig Data Team @ SUSE Apache Bigtop PMC [email protected]@apache.org

Page 2: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

2

A little bit about me

● Scribus Founder and Core Team Member since 2001

● Ex-Cloudera “Kitchen Team baking Hadoop”

● OpenSUSE Community member since 2006

● OpenSUSE Board Member

● Apache Bigtop Founder and PMC

● Packager and contributor for many Open Source apps

● Day Job – SUSE Systems Engineer in Silicon Valley

● High Performance Computing / Big Data Fan

Page 3: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

3

Dilbert on Big Data

Page 4: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

4

Hype Cycle

Page 5: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

5

Linux is the Foundation for Big Data

Scale

Low Cost

Commodity Hardware

No Lock In

“Coopetition”

Page 6: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

6

Page 7: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

7

Big Data – The Jargon List

Hadoop – Core Hadoop is a Data Operating System

Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology.

NoSQL – A way of storing data, mostly in memory for quickly searching for data.

Data has a temperature: Cold Data – stored nearby

Hot / Fast – in memory or intelligent chaching

Live Data – Accessible to Big Data Tools

Dead Data = Offline Data

ACID - Atomicity, Consistency, Isolation, Durability

Sharding – see Wikipedia – it is too complicated :-)

Page 8: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

8

Big Data Challenges

Existing data workflows are siloed

Data is siloed – Formats, proprietary applications

Sensitive Data Concerns

Regulatory Blockages

Budget Constraints

Planning Lead Times

Page 9: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

9

Big Data Challenges

● Data Scrubbing is the step never mentioned but indeed can be one of the biggest challenges.

● Big Data likes memory aka storage.

● Jobs can run longer than some typical mainframe or batch “jobs”.

● Hadoop turns the computing notion of bringing data to processing power on its head. You bring the compute power to where the data resides.

Page 10: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

10

Examples of Big Data volumes

• Scientific measurements (i. e. particle collision results from the Large Hadron Collider at the CERN)

• Financial data like stock information, share-price statistical data, stock related press coverage, etc.

• Medical data: genome database, patient's files in hospitals, information about pharmaceutical

• Indexed web or social media content

• Environmental Records - Weather

• Webserver Access-logs

• Sales data

Page 11: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

11

Five main use cases for Big Data

• Transparency: insights into ongoing business operations

• Decision-testing: What happened (will happen) when (if) we made (make) this decision?

• Individualization in real time: tailoring offerings and services to customer wishes in real time in order to increase customer satisfaction and reduce customer churn

• Intelligent process control and automation

• Innovative data-driven business models

From “Big Data in Action” - http://en.sap.info/big-data-in-action/82754?source=email-en-sapinfo-newsletter-20121204

Page 12: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

12

How to distinguish between several kinds of Big Data?

• Amount of data: large (n terabytes) or very large (n petabytes) or gigantic (n exabytes)?

• Structured data (i. e. relational, column separated) or unstructured data (i. e. documents, webpages)?

• How complex is the data model?

• Transactional or non-transactional?

• Full data integrity required ACID ?

• Usage patterns: Just lots of “reads” or also many “inserts”, “updates” and “deletions”?

• Usage performance: Realtime, short delays, long delays?

• Combination of several questions from above

Page 13: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

13

Hadoop vs SQL (RDBMS)

• No predefined schema

• Fast Loading

• Simpler Data Structures

• Flexible and Agile

• Schema defined in advance

• Data transformed

• Fast Reading

• Standards/Governance

The real innovation is the capability to explore original raw data

Page 14: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

14

When to pick Hadoop vs RMDBS

• Scalablity is important

• Structured or Unstructured

• Complex Data Process

• Speed is important

• ACID Transactions

• Interactive Analytics

A sports car is faster, but a truck can carry more.

Page 15: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

15

Apache Hadoop Strengths

Huge data volumes

Unstructured data

Reliable

Scalable

Lowest cost

Open source

No hardware lock in

Batch processing

Page 16: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

16

Apache Hadoop Weakenesses

Not very efficient at small scale

Real time is challenging at the moment (WIP)

Requires skilled engineers and operations

Less mature than SQL

Weakly defined user roles in data access model (WIP)

Page 17: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

17

What About NoSQL/NewSQL?

Can be a cost effective replacement or supplement for traditional proprietary databases.

There are several e.g MongoDB, Accumulo, Cassandra trying to solve different problems. Each has strengths and weaknesses to evaluate.

Page 18: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

18

Linux Challenges

Scalability – We're hitting the limit of physics with current technology.

The need for better fault tolerance in the O/S. Now helped by live kernel patching in Linux 4.1.

The future will bring us exascale challenges. Think 3-7 years down the road. 1018

Java scalability ?

Stutter affects Hadoop

Page 19: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

19

Emerging Trends in Big Data

Streaming – accessing data in near real time for capture and analysis.

“Fast Data” - in memory or intelligent caching. E.g. Spark, SAP HANA, HP Haven.

Connectors are becoming ubiquitous

Machine learning is becoming more accessible.

Despite lesser performance, Cloud is becoming a more usable option for production.

Page 20: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

20

Evaluation Thoughts

Is Big Data a solution in search of a problem ?

Evaluate the need for real time data vs. near real time.

Do we have right questions to ask ?

How can Big Data workflows be integrated with our existing infrastructure ?

What other agencies might have useful data ?

Pilot Pilot Pilot...

Page 21: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

21

Evaluation Thoughts

Pilot Pilot Pilot...

Page 22: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

22

SUSE Big Data Partner Ecosystem

• Integrated solutions‒ SAP HANA

‒ Teradata Aster Big Analytics Appliance

• Hadoop Distributions‒ Intel

‒ Cloudera

‒ Hortonworks

‒ WANdisco

• Database‒ Intersystems CACHÉ

Page 23: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

23

Bigtop

• Packaging, QA testing and integration stack for Apache Hadoop components

• Made up of engineers from all the most of the Hadoop distros: Cloudera, Hortonworks and WANdisco,along with SUSE and independent contributors

• Almost unique among other Apache projects in that it integrates other projects as its goal

• All major Hadoop distros base their product on Bigtop

Page 24: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

24

Why SUSE for Big Data ?

• SUSE has a decade plus of leadership in HPC/Supercomputing for Linux. Est 50% Top 500. Titan – the biggest runs SLES.

• SLES12 has the most modern optimized kernel for Big Data work loads.

• We have Tier 1 support and relationships with all major open source Hadoop Distributors.

• Competition sees Big Data as an opportunity to sell proprietary solutions.

• We care about this market.

Page 25: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

25

Why SUSE for Big Data ?

• Capable of supporting 64Tb, yes Tb of ram on one system.

• SLES12 has the most modern optimized kernel for Big Data work loads.

• Excellent deployment and management tools.

• Competition sees Big Data as an opportunity to sell proprietary solutions.

• We care about this market.

Page 26: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

26

SUSE & Hortonworks

Joint Flyer

Partner Site

Modern Data Architecture

Page 27: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

27

SUSE Big Data Lab

• Benchmarking

• Software certification

• Integration / test

• Reference architectures

• Demo system

• Remotely accessible

Big Data Cluster in Provo UT for:

Page 28: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

28

Learn More

Visit our web site www.suse.com/solutions/platform.html#big_data

Read our whitepapers Deploying Hadoop on SLESDeploy and Manage Hadoop with SUSE Manager

Contact us [email protected]

Page 29: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

29

Questions ?

[email protected]

Page 30: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

30

Page 31: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

Appendix

Page 32: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

32

Hadoop Core Components

Page 33: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

33

Typical Hadoop Distribution

Page 34: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

34

How Hadoop Works at Its Core

Namenode

Datanodes

Rack 1 Rack 2

Datanodes

Client

Client

Write

Replication

Read

Metadata ops

Block ops

Blocks

Metadata (name, replicas, …):/home/foo/data, 3,...

Page 35: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

35

Hadoop is only one partBut an important part

• The compute layer of big data

• Supports the running of applications on large clusters of commodity hardware.

• Provides a distributed file system (HDFS) that stores data on the compute nodes.

• Enables applications to work with thousands of computers and petabytes of data.

• Lots of momentum – IBM, Microsoft, Oracle, SAP, EMC, HP, Teradata, have built solutions on Hadoop or at least connectors to Hadoop

• Ecosystem of Hadoop players: Intel, Cloudera, HortonWorks, WANdisco, MapR, Greenplum

• Apache support

Page 36: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

36

NameNode

• The NameNode (NN) stores all metadata

• Information about file locations in HDFS

• Information about file ownership and permissions

• Names of the individual blocks

• Location of the blocks

• Metadata is stored on disk and read when the NameNode daemon starts

Page 37: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

37

NameNode2

• File name is fsimage

• Block locations are not stored in fsimage

• Changes to the metadata are made in RAM

• Changes are also written to a log file on disk called edits

• Each Hadoop cluster has a single NameNode

• The Secondary NameNode is not a fail-over NameNode

• The NameNode is a single point of failure (SPOF)

Page 38: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

38

Secondary NameNode (master)

• The Secondary NameNode (2NN) is not-a fail-over NameNode!

• It performs memory/intensive administrative functions for the NameNode.

• Secondary NameNode periodically combines a prior file system snapshot and editlog into a new snapshot

• New snapshot is transmitted back to the NameNode

• Secondary NameNode should run on a separate machine in a large installation

• It requires as much RAM as the NameNode

Page 39: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

39

DataNode

• DataNode (slave)

• JobTracker (master) / exactly one per cluster

• TaskTracker (slave) / one or more per cluster

Page 40: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

40

Running Jobs

• A client submits a job to the JobTracker

• JobTracker assigns a job ID

• Client calculates the input and splits for the job

• Client adds job code and configuration to HDFS

• The JobTracker creates a Map task for each input split

• TaskTrackers send periodic “heartbeats” to JobTracker

• These heartbeats also signal readiness to run tasks

• JobTracker then assigns tasks to these TaskTrackers

Page 41: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

41

Running Jobs

• The TaskTracker then forks a new JVM to run the task

• This isolates the TaskTracker from bugs or faulty code

• A single instance of task execution is called a task attempt

• Status info periodically sent back to JobTracker

• Each block is stored on multiple different nodes for redundancy

• Default is three replicas

Page 42: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

42

Anatomy of a File Write

1. Client connects to the NameNode

2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client

3. Client connects to the first DataNode and starts sending data

4. As data is received by the first DataNode, it connects to the second and starts sending data

5. Second DataNode similarly connects to the third

6. Ack packets from the pipeline are sent back to the client

7. Client reports to the NameNode when the block is written

Page 43: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

43

Hadoop Core Operations – Review

Namenode

Datanodes

Rack 1 Rack 2

Datanodes

Client

Client

Write

Replication

Read

Metadata ops

Block ops

Blocks

Metadata (name, replicas, …):/home/foo/data, 3,...

Page 44: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

44

Expanding on Core Hadoop

Page 45: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

45

Hive, Hbase and Sqoop

Hive‒ High level abstraction on top of MapReduce

‒ Allows users to query data using HiveQL, a language very similar to standard SQL

HBase ‒ A distributed, sparse, column oriented data store

Sqoop ‒ The Hadoop ingestion engine – the basis of connectors

like Teradata, Informatica, DB2 and many others.

Page 46: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

46

Oozie

• Work flow scheduler system to manage Apache Hadoop jobs

• Workflow jobs are Directed Acyclical Graphs (DAGs) of actions

• Coordinator jobs are recurrent Workflow jobs triggered by time (frequency) and data availabilty

• Integrated with the rest of the Hadoop stack ‒ Supports several types of Hadoop jobs out of the box

(such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp)

‒ Also supports system specific jobs (such as Java programs and shell scripts)

Page 47: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

47

Flume

• Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

• Simple and flexible architecture based on streaming data flows

• Robust and fault tolerant with tunable reliability mechanisms and many fail-over and recovery mechanisms

• Uses a simple extensible data model that allows for online analytic application

Page 48: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

48

Mahout

• The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries

• Currently Mahout supports mainly three use cases: ‒ Recommendation mining takes users' behavior and from

that tries to find items users might like

‒ Clustering, for example, takes text documents and groups them into groups of topically related documents

‒ Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category

Page 49: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

49

Whirr™

• Set of libraries for launching Hadoop instances on clouds

• A cloud-neutral way to run services ‒ You don't have to worry about the idiosyncrasies of each

provider.

• A common service API‒ The details of provisioning are particular to the service.

• Smart defaults for services‒ You can get a properly configured system running quickly, while

still being able to override settings as needed

Page 50: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

50

Giraph

• Iterative graph processing system built for high scalability

• Currently used at Facebook to analyze the social graph formed by users and their connections

Page 51: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

51

Apache Pig

• Platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs

• Language layer currently consists of a textual language called Pig Latin, which has the following key properties:

‒ Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

‒ Extensibility. Users can create their own functions to do special-purpose processing.

Page 52: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

52

Ambari

• Project goal is to develop software that simplifies Hadoop cluster management

• Provisioning a Hadoop Cluster

• Managing a Hadoop Cluster

• Monitoring a Hadoop Cluster‒ Ambari leverages well known technology like Ganglia and

Nagios under the covers.

• Provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs

Page 53: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

53

HUE – Hadoop User Experience

• Graphical front end to Hadoop tools for launching, editing and monitoring jobs

• Provides short cuts to various command line shells for working directly with components

• Can be integrated with authentication services like Kerberos or Active Directory

Page 54: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

54

R Statistical Language

● Statistical Language – Open Source Licensed

● Similar to Octave or Mathlab

● Not currently packaged for SLES or openSUSE

Page 55: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

55

Shark/Spark

• Spark is a real time query framework developed at Berkeley AMP.

• Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining.

• Shark uses Spark to process real time queries in Hive.

• Up to 100x faster than MapReduce in some cases.

• Going in to most Hadoop distros now or soon.

Page 56: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

56

Zookeeper

• An orchestration stack.

• Centralized service for: ‒ Maintaining configuration information

‒ Naming

‒ Providing distributed synchronization

‒ Delivering group services.

Page 57: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

57

NoSQL

Cassandra

• Enterprise provider is Datastax

• Keyspace -> container for column families

• High Performance, Highly Scalable, Available - No SPOF

• Replication by hashing data between nodes

• Query by Column - Requires index

• SQL-Like

• Native support for Apache Hadoop

• Flexible Schema -> Change at runtime.

• No transactions, no JOINs

Page 58: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

58

NoSQL (cont)

Accumulo

• Like Hbase, a BigTable clone. Join-Less

• Runs on top of Hadoop. MapReduce with hadoop.

• Used for scanning large two-dimensional tables

• Accumulo, HBase and Cassandra are part of the Hadoop ecosystem. HBase supported by the Hadoop provider.

• Hugely scalable NoSQL database developed at NSA.

• Only NoSQL DB with cell level locking and security..

Page 59: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

59

NoSQL (cont)

MongoDB

• Enterprise provider MongoDB Inc, was known as 10gen

• Non-Relational DataStore for JSON Documents

• {"name":"Alejandro"}

• {"name":"Alejandro", "Age": 31, likes:["soccer","Golf", "Beach"]}

• Schemaless, container vs table, document vs row

• Does not support JOINs or transactions (across multiple documents).

• Does not perform as memcached, not as functional as RDBMS. Sits in the middle.

Page 60: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

60

NoSQL (cont - MongoDB)

• Provides the "mongo" shell - JavaScript interpreter, tools and drivers for easy access to API.

• Support replication and sharding.

• Supports an aggregation framework, mapReduce, Hadoop plugin.

• Document size Max 16MB -> GridFS to store big data + metadata.

Page 61: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

61

Web UI Ports for Users

• Daemon Default Port Configuration parameter

• NameNode 50070 dfs.http.address

• DataNode 50075 dfs.datanode.http.address

• Secondary NameNode 50090 dfs.secondary.http.address

• Backup/Checkpoint Node 50105 dfs.backup.http.address

• JobTracker 50030 mapred.job.tracker.http.address

• TaskTracker 50060 mapred.task.tracker.http.address

Page 62: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

62

http://bigdatauniversity.com/

https://ccp.cloudera.com/display/DOC/Documentation

http://thecloudtutorial.com/hadoop-tutorial.html

http://www.saphana.com/community/learn

http://developer.yahoo.com/hadoop/tutorial/

http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

Page 63: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

63

Resources

• SUSE Big Data website‒ https://www.suse.com/solutions/platform.html#big_data

• SUSE Big Data Flyer‒ http://www.novell.com/docrep/2013/03/suse_linux_enterpri

se_foundation_for_big_data_solution.pdf

• SUSE Big Data Contacts‒ Business: Frank Rego [email protected]

‒ Technical: Peter Linnell [email protected]

Page 64: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

64

Page 65: Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

Unpublished Work of SUSE. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.