64
A World of Data “Thingsternet” Compete by asking bigger questions Living online Big Data “Gizillions” of mobile transactions

Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

A World of Data

“Thingsternet”

Compete by asking bigger

questions

Living online

Big Data

“Gizillions” of mobile transactions

Page 2: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

$

$$$...

Page 3: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

???

Page 4: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

SLA

Page 5: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored
Page 6: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Yaaaay – Hadoop to Save the Daaaay!!

• But it’s not always easy to tame an elephant…

Page 7: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored
Page 8: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

CUSTOMERS WEB CLIENT WEB SHOP BACKEND

WEB SHOP DATA BASE

~100GB

Product and Customer

Transaction Data

Introducing “DataCo”

“We don’t really have a big data

problem…”

Page 9: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

> 6 months?

CUSTOMERS WEB CLIENT WEB SHOP BACKEND

WEB SHOP DATA BASE

Mobile App Data

Web App Click Stream

Data

IT/Ops and InfoSec Data

Product and Customer

Transaction Data

Introducing “DataCo”

Page 10: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Hive

Active Archive / Self Serve Ad-hoc BI

• Top sold products last 6, 12, and 18 months?

SQL

HDFS

Impala

Page 11: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Using Sqoop to Ingest Data from MySQL

• Sqoop is a bi-directional structured data ingest tool

• Simple UI in Hue, more commonly used from the shell

$ sqoop import-all-tables -m 12 –connect

jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba

--password=yow!2014 --compression-codec=snappy --as-avrodatafile

--warehouse-dir=/user/hive/warehouse

$ sqoop import -m 12 –connect jdbc:mysql://my.sql.host:3306/retail_db

--username=dataco_dba --password=yow!2014

--table my_cool_table --hive-import --as-parquetfile

Page 12: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Create Tables in Hive

• Hive is a batch query tool, but also the keeper of table structures

• Remember: structure is stored _separate_ from data

hive> CREATE EXTERNAL TABLE products

> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

> STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

> OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

> LOCATION 'hdfs:///user/hive/warehouse/products'

> TBLPROPERTIES

('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');

Page 13: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Use Impala via Hue to Query

Page 14: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

$

$$$...

Page 15: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Correlate Multi-type Data Sets

• Top viewed products last 6, 12, and 18 months?

Hive

SQL

HDFS

Impala Flume

Page 16: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Ingest Data Using Flume

• Pub/sub ingest framework

• Flexible multi-level (mini-transformation) pipeline

FLUME SOURCE

FLUME SINK

Continuously generated events, e.g. syslog, tweets

Flume Agent, HDFS, HBase, Solr, or other destination

OptionalLogic

FLUME AGENT

Page 17: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Create Hive Tables over Log Data

• New use case, new data

• Create new tables over semi-structured log data

CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING,

method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING,

dash STRING, user_agent STRING) ROW FORMAT SERDE

'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" )

LOCATION '/user/hive/warehouse/original_access_logs';

CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method

STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash

STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR

/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE

tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;

Page 18: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Use Impala and Hue to Query

Missing!!!285716349

Page 19: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

$

$$$...

Page 20: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

!!!

Page 21: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Solr

Multi-Use-Case Data Hub

• Why is sales dropping over the last 3 days?

HDFS

Search Queries

Flume

Page 22: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Create your Index

• Create an empty Solr index configuration directory

• Edit the Solr Schema file to have the fields you want to search over

$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --generate live_logs_dir

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false" />

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

<field name="ip" type="text_general" indexed="true" stored="true"/>

<field name="request_date" type="date" indexed="true" stored="true"/>

Page 23: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Create your Index cont.

• Upload your configuration for a collection to ZooKeeper

• Tell Solr to start serving up a collection and start indexing data for it

$ solrctl --zk <ALL YOUR ZK IPs>/ solr collection --create live_logs -s 4

$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --create live_logs

./live_logs_dir

Page 24: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Flume and Morphline Pipeline

Page 25: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Flume with Morphlines Configured

• Configure Flume to use your Morphlines and post parsed data to Solr

….

# Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink

agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000

agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile =

/opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline

agent1.sinks.solrSink.threadCount = 1

…..

Page 26: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Dynamic Search UI in Hue

Page 27: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Shared Storage!!

Page 28: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored
Page 29: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Benefits• Ad-hoc and

faster insight• Reduced

asthma related ICU visits

• Total license fees < 3 processor licenses for EDW

Solution• 50GB monitor

data per week• 2TB capacity• Sqoop, Solr,

Impala, HDFS

Challenges• Only 3 days’ of

monitoring data capacity

• No ability to correlate large research data sets

• No ability to ad-hoc study environment impact

How Do We Improve Healthcare?

Page 30: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

How Do We Feed The World?

Global Warming Changes Conditions

How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?

Page 31: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Benefits• Streamlined

processes• Time to results

reduced from years to months!!!

Solution• PB-scale• HBase, HDFS,

Solr, MapReduce, Sqoop, Impala, …

Challenges• Time to market

for each new product: 5-10 years

• 1,000+ scientists working in silos

• Data processing bottlenecks slow development

How Do We Feed The World?

Page 32: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Challenges• 100-200 B

events/month• Real-time multi-

type event correlation complex

• No way to do ad-hoc game analytics

Benefits• Ad-hoc insight

on feature trends

• Significant TTR reduction

• ROI realized in the 1st week

Solution• ~20 nodes• 256GB RAM

servers• Flume, Solr,

Impala, HDFS

Page 33: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Learn More?

• Stop by the Cloudera booth today!

• Play on your own: cloudera.com/live

• Get training: http://cloudera.com/content/cloudera/en/training.html

• Join the Community: [email protected]

• Connect with me: @EvaAndreasson

Page 34: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Hope You Enjoyed This Talk!

Don’t forget to VOTE!!!

Page 35: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored
Page 36: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Bonus Track…

Page 37: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

My Advice for the Road…

Page 38: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Try Something Simple First…

Page 39: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Decide what to Cook!

Page 40: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Collect All Ingredients

Page 41: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Use the Right Tool for the Right Task

Page 42: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Prepare All Ingredients

Page 43: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Don’t Forget the Importance of Visualization!

Page 44: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored
Page 45: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Benefits• Faster, cheaper

genome sequencing

• Searchable index of variant call data for biologists to explore

Solution• Integration &

storage of multi-structured experimental data

• Data access & exploration via Impala, R, HBase, Solr, Hive

Challenges• Tons of

information locked away in medical records & scientific studies

• Different sources & systems can’t “talk” to each other

Page 46: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Using Sqoop to Ingest Data from MySQL

• View your imported “tables”

• View all Avro files constituting a table

$ hadoop fs -ls /user/hive/warehouse/

$ hadoop fs -ls /user/hive/warehouse/mytablename/

Page 47: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Hadoop - A New Approach toData Management

Schema on Read

Distributed Storage

Distributed Processing

Active Archive

Cost-Efficient Offload

Flexible Analytics

Page 48: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Hadoop: Storage & Batch Processing

The Birth of the Data Lake

Page 49: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

• Core Hadoop • Core Hadoop • Core Hadoop• Hbase• ZooKeeper• Mahout

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie

2006 2007 2008 2009 2010 2011

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet

2012 2013 2014• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet• Solr• Sentry

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet• Solr• Setnry• Spark• Kafka

A Rapidly Growing Ecosystem

Page 50: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

The Rise of an Enterprise Data Hub

Applications

Page 51: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2005-2007 – Hadoop

MapReduce

Page 52: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2008 – HBase, ZooKeeper, Mahout

MapReduce

HBase

ZooKeeper

Mahout

Page 53: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2009 – Hive, Pig

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Page 54: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2010 – Flume, Sqoop, Avro

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Page 55: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2011 – Oozie, Hue

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Page 56: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2012 – YARN, Impala, Parquet

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Parquet

Impala

YARN

Page 57: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2013 – Solr, Sentry

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Parquet

Impala Solr

YARNSentry

Page 58: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

HDFS

2014 – Spark, Kafka

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Parquet

Impala Solr

YARNSentry

Spark

Kafka

Page 59: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Inter-active SQL

Distributed File System (Scalable Storage)

The Hadoop Ecosystem – Explained!

Event-based data ingest

Batch Processing

KeyValueStore

SQL

Proc. Oriented

Query

Machine Learning

Process MgmtWorkflow Mgmt

GUI

Resource Management and Scheduling

Free-Text

Search

Real Time

Processing

Access Control

DB

Page 60: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

Common Use Cases

• Threat detection

• Active archive / accessible global knowledge base

• Data accuracy

• Streamlined cross-data type aggregation

• Richer customer profiling / ecommerce experience

• Interactive market segmenting / customer identification

• Expedited data modeling

• ….

Page 61: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

The Right Tool For the Right Task

Tool Workload Use Case Result Ordering

Hive Batch SQL, Analytics & Joins Structured

Pig Batch Proc. Oriented SQL, Analytics & Joins

Structured

Impala Interactive SQL, Analytics & Joins Structured

Solr Interactive Fuzzy, Phonetic, Polygon, GEO-special

Relevance-based

HBase Real Time Random key-lookups over sparsely populated columnar data

Scan-order

Spark NRT Advanced analytics & ML Sorted

Page 62: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

When to use what?

• Real Time Query (e.g. Impala) • I want to do BI reports or interactive analytical aggregations but

not wait hours for the response

• Batch Query (e.g. Pig, Hive)• I have nightly batch query jobs as part of a workflow

• Real Time Search (e.g. SolrCloud)• I have unstructured data I want to free text over

• My SQL queries are getting more and more complex as they need to contain 15+ “like” conditions

• Real time key lookups (e.g. Hbase)• I want random access to sparsely populated table-like data

• I want to compare user profiles or behavior in real time

Page 63: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored

When to use what?

• Spark

• I want to implement analytics algorithms over my data, and my data sets fit into memory

• I have real time streaming data I want to analyze in real time

• MapReduce

• I want to do fail-safe large ETL processing workloads

• My data does not fit into memory and I want to batch process it with my custom logic – no real time needs

Page 64: Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch query tool, but also the keeper of table structures • Remember: structure is stored