Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
A World of Data
“Thingsternet”
Compete by asking bigger
questions
Living online
Big Data
“Gizillions” of mobile transactions
$
$$$...
???
SLA
Yaaaay – Hadoop to Save the Daaaay!!
• But it’s not always easy to tame an elephant…
CUSTOMERS WEB CLIENT WEB SHOP BACKEND
WEB SHOP DATA BASE
~100GB
Product and Customer
Transaction Data
Introducing “DataCo”
“We don’t really have a big data
problem…”
> 6 months?
CUSTOMERS WEB CLIENT WEB SHOP BACKEND
WEB SHOP DATA BASE
Mobile App Data
Web App Click Stream
Data
IT/Ops and InfoSec Data
Product and Customer
Transaction Data
Introducing “DataCo”
Hive
Active Archive / Self Serve Ad-hoc BI
• Top sold products last 6, 12, and 18 months?
SQL
HDFS
Impala
Using Sqoop to Ingest Data from MySQL
• Sqoop is a bi-directional structured data ingest tool
• Simple UI in Hue, more commonly used from the shell
$ sqoop import-all-tables -m 12 –connect
jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba
--password=yow!2014 --compression-codec=snappy --as-avrodatafile
--warehouse-dir=/user/hive/warehouse
$ sqoop import -m 12 –connect jdbc:mysql://my.sql.host:3306/retail_db
--username=dataco_dba --password=yow!2014
--table my_cool_table --hive-import --as-parquetfile
Create Tables in Hive
• Hive is a batch query tool, but also the keeper of table structures
• Remember: structure is stored _separate_ from data
hive> CREATE EXTERNAL TABLE products
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION 'hdfs:///user/hive/warehouse/products'
> TBLPROPERTIES
('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');
Use Impala via Hue to Query
$
$$$...
Correlate Multi-type Data Sets
• Top viewed products last 6, 12, and 18 months?
Hive
SQL
HDFS
Impala Flume
Ingest Data Using Flume
• Pub/sub ingest framework
• Flexible multi-level (mini-transformation) pipeline
FLUME SOURCE
FLUME SINK
Continuously generated events, e.g. syslog, tweets
Flume Agent, HDFS, HBase, Solr, or other destination
OptionalLogic
FLUME AGENT
Create Hive Tables over Log Data
• New use case, new data
• Create new tables over semi-structured log data
CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING,
method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING,
dash STRING, user_agent STRING) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" )
LOCATION '/user/hive/warehouse/original_access_logs';
CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method
STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash
STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR
/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE
tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;
Use Impala and Hue to Query
Missing!!!285716349
$
$$$...
!!!
Solr
Multi-Use-Case Data Hub
• Why is sales dropping over the last 3 days?
HDFS
Search Queries
Flume
Create your Index
• Create an empty Solr index configuration directory
• Edit the Solr Schema file to have the fields you want to search over
$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --generate live_logs_dir
…
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false" />
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="ip" type="text_general" indexed="true" stored="true"/>
<field name="request_date" type="date" indexed="true" stored="true"/>
…
Create your Index cont.
• Upload your configuration for a collection to ZooKeeper
• Tell Solr to start serving up a collection and start indexing data for it
$ solrctl --zk <ALL YOUR ZK IPs>/ solr collection --create live_logs -s 4
$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --create live_logs
./live_logs_dir
Flume and Morphline Pipeline
Flume with Morphlines Configured
• Configure Flume to use your Morphlines and post parsed data to Solr
….
# Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000
agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile =
/opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline
agent1.sinks.solrSink.threadCount = 1
…..
Dynamic Search UI in Hue
Shared Storage!!
Benefits• Ad-hoc and
faster insight• Reduced
asthma related ICU visits
• Total license fees < 3 processor licenses for EDW
Solution• 50GB monitor
data per week• 2TB capacity• Sqoop, Solr,
Impala, HDFS
Challenges• Only 3 days’ of
monitoring data capacity
• No ability to correlate large research data sets
• No ability to ad-hoc study environment impact
How Do We Improve Healthcare?
How Do We Feed The World?
Global Warming Changes Conditions
How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?
Benefits• Streamlined
processes• Time to results
reduced from years to months!!!
Solution• PB-scale• HBase, HDFS,
Solr, MapReduce, Sqoop, Impala, …
Challenges• Time to market
for each new product: 5-10 years
• 1,000+ scientists working in silos
• Data processing bottlenecks slow development
How Do We Feed The World?
Challenges• 100-200 B
events/month• Real-time multi-
type event correlation complex
• No way to do ad-hoc game analytics
Benefits• Ad-hoc insight
on feature trends
• Significant TTR reduction
• ROI realized in the 1st week
Solution• ~20 nodes• 256GB RAM
servers• Flume, Solr,
Impala, HDFS
Learn More?
• Stop by the Cloudera booth today!
• Play on your own: cloudera.com/live
• Get training: http://cloudera.com/content/cloudera/en/training.html
• Join the Community: [email protected]
• Connect with me: @EvaAndreasson
Hope You Enjoyed This Talk!
Don’t forget to VOTE!!!
Bonus Track…
My Advice for the Road…
Try Something Simple First…
Decide what to Cook!
Collect All Ingredients
Use the Right Tool for the Right Task
Prepare All Ingredients
Don’t Forget the Importance of Visualization!
Benefits• Faster, cheaper
genome sequencing
• Searchable index of variant call data for biologists to explore
Solution• Integration &
storage of multi-structured experimental data
• Data access & exploration via Impala, R, HBase, Solr, Hive
Challenges• Tons of
information locked away in medical records & scientific studies
• Different sources & systems can’t “talk” to each other
Using Sqoop to Ingest Data from MySQL
• View your imported “tables”
• View all Avro files constituting a table
$ hadoop fs -ls /user/hive/warehouse/
$ hadoop fs -ls /user/hive/warehouse/mytablename/
Hadoop - A New Approach toData Management
Schema on Read
Distributed Storage
Distributed Processing
Active Archive
Cost-Efficient Offload
Flexible Analytics
Hadoop: Storage & Batch Processing
The Birth of the Data Lake
• Core Hadoop • Core Hadoop • Core Hadoop• Hbase• ZooKeeper• Mahout
• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive
• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop
• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie
2006 2007 2008 2009 2010 2011
• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet
2012 2013 2014• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet• Solr• Sentry
• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet• Solr• Setnry• Spark• Kafka
A Rapidly Growing Ecosystem
The Rise of an Enterprise Data Hub
Applications
HDFS
2005-2007 – Hadoop
MapReduce
HDFS
2008 – HBase, ZooKeeper, Mahout
MapReduce
HBase
ZooKeeper
Mahout
HDFS
2009 – Hive, Pig
MapReduce
HBase
ZooKeeper
MahoutHive Pig
HDFS
2010 – Flume, Sqoop, Avro
MapReduce
HBase
ZooKeeper
MahoutHive Pig
Flume
DB
Avro
HDFS
2011 – Oozie, Hue
MapReduce
HBase
ZooKeeper
MahoutHive Pig
Flume
DB
Avro
Oozie
Hue
HDFS
2012 – YARN, Impala, Parquet
MapReduce
HBase
ZooKeeper
MahoutHive Pig
Flume
DB
Avro
Oozie
Hue
Parquet
Impala
YARN
HDFS
2013 – Solr, Sentry
MapReduce
HBase
ZooKeeper
MahoutHive Pig
Flume
DB
Avro
Oozie
Hue
Parquet
Impala Solr
YARNSentry
HDFS
2014 – Spark, Kafka
MapReduce
HBase
ZooKeeper
MahoutHive Pig
Flume
DB
Avro
Oozie
Hue
Parquet
Impala Solr
YARNSentry
Spark
Kafka
Inter-active SQL
Distributed File System (Scalable Storage)
The Hadoop Ecosystem – Explained!
Event-based data ingest
Batch Processing
KeyValueStore
SQL
Proc. Oriented
Query
Machine Learning
Process MgmtWorkflow Mgmt
GUI
Resource Management and Scheduling
Free-Text
Search
Real Time
Processing
Access Control
DB
Common Use Cases
• Threat detection
• Active archive / accessible global knowledge base
• Data accuracy
• Streamlined cross-data type aggregation
• Richer customer profiling / ecommerce experience
• Interactive market segmenting / customer identification
• Expedited data modeling
• ….
The Right Tool For the Right Task
Tool Workload Use Case Result Ordering
Hive Batch SQL, Analytics & Joins Structured
Pig Batch Proc. Oriented SQL, Analytics & Joins
Structured
Impala Interactive SQL, Analytics & Joins Structured
Solr Interactive Fuzzy, Phonetic, Polygon, GEO-special
Relevance-based
HBase Real Time Random key-lookups over sparsely populated columnar data
Scan-order
Spark NRT Advanced analytics & ML Sorted
When to use what?
• Real Time Query (e.g. Impala) • I want to do BI reports or interactive analytical aggregations but
not wait hours for the response
• Batch Query (e.g. Pig, Hive)• I have nightly batch query jobs as part of a workflow
• Real Time Search (e.g. SolrCloud)• I have unstructured data I want to free text over
• My SQL queries are getting more and more complex as they need to contain 15+ “like” conditions
• Real time key lookups (e.g. Hbase)• I want random access to sparsely populated table-like data
• I want to compare user profiles or behavior in real time
When to use what?
• Spark
• I want to implement analytics algorithms over my data, and my data sets fit into memory
• I have real time streaming data I want to analyze in real time
• MapReduce
• I want to do fail-safe large ETL processing workloads
• My data does not fit into memory and I want to batch process it with my custom logic – no real time needs