65
Chris Huang Sr. Manager, Core Tech 2014/9/24 Real-time Big Data Applications with Hadoop Ecosystem 06/07/2022 1 Confidential | Copyright 2013 TrendMicro Inc.

Real time big data applications with hadoop ecosystem

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Real time big data applications with hadoop ecosystem

04/10/20231 Confidential | Copyright 2013 TrendMicro Inc.

Chris Huang

Sr. Manager, Core Tech

2014/9/24

Real-time Big Data Applications with Hadoop Ecosystem

Page 2: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

About – Chris Huang

• Chris Huang– SPN Solution Developer Manager– SPN Hadoop Architect– Hadoop.TW Active Member

• Believes Cloud, Service, Software, Big Data are critical factors for Taiwan’s future economic development

Page 3: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 3

Conference Talks

Page 4: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 4

Conference Talks

Page 5: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 5

Hot Keywords in Hadoop Community

Real-time• Impala, Stinger

Computing Framework• YARN, Tez

In Memory• Spark

Streaming• Kafka, Storm

Page 6: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 6

Big Data Applications

• Operational– Real-time– Near Real-time

• Analytical – Batch– Interactive– Near Real-time– Streaming

Page 7: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 7

An Online Music Example

• Operational– Recent N login time (listen duration)– Recent N album/artist user browses– Recent N keyword user search– Recent N song/album/artist user listens (buys)– Recent N month user’s purchase amount

• Analytical– Recommend right song/album/artist to right user at right time– Correlate similar song/album/artist (CDDB or user behavior)– Know seasonal music trending (X’max, Valentine’s Day, New

Year)– Know regional music trending– Calculate regional leaderboard– Connect user with social network

Page 8: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 8

An Online Banking Example

• Operational – Recent N login time / frequency– Recent N items purchased by credit card– Recent N month balance amount– Recent N transfer in/out amount– Recent N investment event– Recent N month investment balance

• Analytical– Know user’s profile more (assets/debts/shopping habits/family)– Recommend right product to right user (investment, credit card,

loan)– Know seasonal trending (tax month/year end/back to school/X’mas)– Know regional investment product leaderboard (by different age)– Recommend product by similar user profile

Page 9: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 9

Building Your Big Data Applications

• Think about your data– Entity or Event?

• Think about your use case– Operational or Analytic?

• Think about your data user– External or Internal?

Page 10: Real time big data applications with hadoop ecosystem

04/10/2023 10Confidential | Copyright 2013 TrendMicro Inc.

Slides from “Apache HBase Application Archetypes”, HBaseCon 2014

You can Replace HBase with similar alternatives, but concepts are the same

Think About Your Data

Page 11: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 11

Page 12: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 12

Page 13: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 13

Page 14: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 14

Page 15: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 15

Page 16: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 16

Page 17: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 17

Page 18: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 18

Page 19: Real time big data applications with hadoop ecosystem

Think About Your Use Case

04/10/2023 19Confidential | Copyright 2013 TrendMicro Inc.

Page 20: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 20

Operational Use Case 1

MR / Spark

MR / Spark

Real-time

Real-time

BatchBatch

Real-time

HDFS

Page 21: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 21

HBase: No Secondary Index (yet)

• Search index building (row key)• Use Solr to make text data searchable

– Snapshot & clone table– Index column qualifier text– Record row-key in Solr document– Use HBase client to fetch data

• Usually less than few seconds

Page 22: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 22

Operational Use Case 2 (SPN)

Solr Client

Get, Scanlow latency

high throughput

Index Query

MapReduce

Pig

HDFS

Flume

Feed App

Real-time

Real-time

Batch

Page 23: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 23

Operational Use Case 3 (Mixed)

Solr Client

Get, Scanlow latency

high throughput

Index Query

MapReduce

Pig

HDFS

Flume

Feed App

Real-time Real-time

Batch

HBase Client

GetsShort scan

HBase Client

Put, Incr, Append

Bulk Import

HBase Client

MR / Spark Batch

HBase Replication Solr

MR / SparkBatch

Page 24: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 24

HBase or HDFS?

• Depends on what’s your data– Entity or Event?

• Depends on your workload– Low latency? – Random read/write? – Short/full scan?– Sequential read/write? – Update?

Page 25: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 25

Wait…Batch for

Operational?

Page 26: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 26

Yes, Why not?

Page 27: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 27

Page 28: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 28

Page 29: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 29

Page 30: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 30

Operational: Batch + Real-time

• Bridge the gap between batch and now• 80/20 rule

– HDFS/MapReduce/Spark solves 80% easily– Remaining 20% takes 80% of the efforts

• Go as close as possible, don’t overdo it!

Page 31: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 31

What is Real-time?

• Real-time is NOT always “faster than batch”– If you have really BIG DATA

• Most of the time, we want Timely Information• Minimize the gap between scheduled batch jobs

Hourly Job

Hourly Job

Hourly Job

How to get result at 1:33?

Page 32: Real time big data applications with hadoop ecosystem

04/10/2023 32Confidential | Copyright 2013 TrendMicro Inc.

Batch/streaming compute

Near real-time/interactive deliver

Analytical Use Case

Page 33: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 33

Near Real-time Interactive

Page 34: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 34

Recommendation System

Page 35: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 35

Page 36: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 36

The Online Music Example

• Operational– Recent N login time (listen duration )– Recent N album/artist user browses– Recent N keyword user search– Recent N song/album/artist user listens (buys)– Recent N month user’s purchase amount

• Analytical– Recommend right song/album/artist to right user at right time– Correlate similar song/album/artist (CDDB or user behavior)– Know seasonal music trending (X’max, Valentine’s Day, New

Year)– Know regional music trending– Calculate regional leaderboard– Connect user with social network

Do you really want to analytical result (recommendation)

EVERY 50 millisecond?

Page 37: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 37

Analytical Use Case 1

Batch

HDFS

Solr Client

Index Query

Real-time

Page 38: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 38

Analytical Use Case 2 (SPN)

“A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”, HBaseCon 2014

Page 39: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 39

Page 40: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 40

Page 41: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 41

You Need an Interactive

Analytic Engine

Page 42: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 42

Stinger

Page 43: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 43

Page 44: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala Architecture

Datanode

Tasktracker

Regionserver

impala daemon

2

NN, JT, HMActive

NN, JT, HMStandby

Datanode

Tasktracker

Regionserver

impala daemon

Datanode

Tasktracker

Regionserver

impala daemon

Datanode

Tasktracker

Regionserver

impala daemon

State store

Catalog

Hive Metastore

Page 45: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 46: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 47: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 48: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 49: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc.

Apache Pig (MapReduce)

• Do hourly count on akamai log– A = load 'date://2014/07/20/00'

using AkamaiRCLoader();B = foreach (group A all) COUNT_STAR(A);dump B;

– …0% complete100% complete(194202349)

2

4mins, 28sec

Too Slow for Interactive

Page 50: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc.

Using Impala

• No memory cache– > select count(*) from akafast

where day=20140720 and hour=0– 194202349

• with OS cache

• Do a further query:– select count(*) from akafast where day=20140720

and hour=00 and c='US';– 41118019

2

96.46s

9.07s

6.57s

Make Sense Now

Page 51: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 51

Don’t Connect Analytic

Engine with Operational Use Case

Page 52: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 52

Analytical Use Case 3

low latency

high throughput

Impala/Stinger

HDFS

Flume

Feed App

Real-time

Real-time

Interactive

HBase Client

GetsShort scan

HBase Client

Put, Incr, Append

Bulk Import

HBase Client

MR / Spark Batch

Customer

Analyst

Page 53: Real time big data applications with hadoop ecosystem

04/10/2023 53Confidential | Copyright 2013 TrendMicro Inc.

Streaming Use Cases

Page 54: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 54

Page 55: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 55

Page 56: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 56

Page 57: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 57

Page 58: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 58

TME – Trend Message Exchange

http://trendmicro.github.io/tme/

Page 59: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 59

Kafka/Storm

Kafka/Storm

Streaming Operational Use Case

low latency

high throughput

HDFS

HBase Client

Put, Incr, Append

HBase Client

GetsShort scan

Real-time

Streaming

Solr Client

Index Query

Streaming

Page 60: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 60

Kafka/Storm

Streaming Analytical Use Case

low latency

high throughput

HDFS

HBase Client

Put, Incr, Append

Flume

Feed App

HBase Client

GetsShort scan

Impala/Stinger

Interactive

Analyst

Real-time Customer

Streaming

Page 61: Real time big data applications with hadoop ecosystem

04/10/2023 61Confidential | Copyright 2013 TrendMicro Inc.

Think About Your Data User

Page 62: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 62

Data User

• External– Customer– Partner

• Internal– Business report user– Data researcher– Data analyst– Algorithm developer

• They want instant response• They don’t know (and don’t care) if

the recommendation is computed 1 hour ago or 50 ms ago

• Interactive or near real-time is enough

• Sometimes even wait for batch (make data small and analyze)

• Of course, everyone wants result faster, but it depends on your investment $$

Page 63: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 63

No Silver Bullet

For Real-time, Or Big Data Application

Page 64: Real time big data applications with hadoop ecosystem

04/10/2023

64

Q&A

Confidential | Copyright 2013 TrendMicro Inc.

Page 65: Real time big data applications with hadoop ecosystem

04/10/2023

Confidential | Copyright 2013 TrendMicro Inc. 65