55
Sadayuki Furuhashi uentd.org How to collect Big Data into Hadoop Big Data processing to collect Big Data

How to collect Big Data into Hadoop

Embed Size (px)

DESCRIPTION

Big Data processing to collect Big Data

Citation preview

Page 1: How to collect Big Data into Hadoop

Sadayuki Furuhashi!uentd.org

How to collect Big Data

into HadoopBig Data processing to collect Big Data

Page 2: How to collect Big Data into Hadoop

Self-introduction

> Sadayuki Furuhashi

> Treasure Data, Inc.Founder & Software Architect

> Open source projectsMessagePack - efficient serializer (original author)

Fluentd - event collector (original author)

Page 4: How to collect Big Data into Hadoop

Today’s topic

Page 5: How to collect Big Data into Hadoop

Report &

Big Data

Monitor

Page 6: How to collect Big Data into Hadoop

Collect Store Process Visualize

Report &

Big Data

Monitor

Page 7: How to collect Big Data into Hadoop

Store Process

ClouderaHorton WorksMapR

Collect Visualize

TableauExcel

R

easier & shorter time

Page 8: How to collect Big Data into Hadoop

Store ProcessCollect Visualize

ClouderaHorton WorksMapR

TableauExcel

R

easier & shorter timeHow to shorten here?

Page 9: How to collect Big Data into Hadoop

Problems to collect data

Page 10: How to collect Big Data into Hadoop

Poor man’s data collection

1. Copy files from servers using rsync

2. Create a RegExp to parse the files

3. Parse the files and generate a 10GB CSV file

4. Put it into HDFS

Page 11: How to collect Big Data into Hadoop

Problems to collect “big data”

> Includes broken values> needs error handling & retrying

> Time-series data are changing and uncler> parse logs before storing

> Takes time to read/write> tools have to be optimized and parallelized

> Takes time for trial & error> Causes network traffic spikes

Page 12: How to collect Big Data into Hadoop

Problem of poor man’s data collection

> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient

Page 13: How to collect Big Data into Hadoop

Basic theoriesto collect big data

Page 14: How to collect Big Data into Hadoop

Divide & Conquer

error

error

Page 15: How to collect Big Data into Hadoop

Divide & Conquer & Retry

error retry

error retry retry

retry

Page 16: How to collect Big Data into Hadoop

Streaming

Don’t handle big files here Do it here

Page 17: How to collect Big Data into Hadoop

Apache Flume and Fluentd

Page 18: How to collect Big Data into Hadoop

Apache Flume

Page 19: How to collect Big Data into Hadoop

AgentAgentAgentAgent

Collector

Collector

Apache Flume

access logs

app logs

system logs

...

Page 20: How to collect Big Data into Hadoop

Apache Flume - network topology

AgentAgentAgentAgent

Collector

CollectorCollector

Master

AgentAgentAgentAgent

Collector

CollectorCollector

ack

send

send/ack

Flume OG

Flume NG

Page 21: How to collect Big Data into Hadoop

plugin

Apache Flume - pipeline

Flume OG

Flume NG

Source Sink

Source SinkChannel

Page 22: How to collect Big Data into Hadoop

Apache Flume - con!guration

Master

AgentAgentAgentAgent

Collector

CollectorCollectorFlume NG

Master managesall configuration

(optional)

Page 23: How to collect Big Data into Hadoop

Apache Flume - con!guration

# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1

# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory

# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1

Page 24: How to collect Big Data into Hadoop

Fluentd

Page 25: How to collect Big Data into Hadoop

Fluentd - network topology

fluentdfluentdfluentdfluentd

fluentd

fluentdfluentd

send/ackFluentd

AgentAgentAgentAgent

Collector

CollectorCollector

send/ackFlume NG

Page 26: How to collect Big Data into Hadoop

plugin

Fluentd - pipeline

FluentdInput OutputBuffer

Source SinkChannelFlume NG

Page 27: How to collect Big Data into Hadoop

Fluentd - con!guration

Fluentd

fluentdfluentdfluentdfluentd

fluentd

fluentdfluentd

Use chef, puppet, etc. for configuration(they do things better)

No central node - keep things simple

Page 28: How to collect Big Data into Hadoop

Fluentd - con!guration

<source> type forward port 24224</source>

<match **> type file path /var/log/logs</match>

Page 29: How to collect Big Data into Hadoop

Fluentd - con!guration

<source> type forward port 24224</source>

<match **> type file path /var/log/logs</match>

# source

host1.sources = avro-source1

host1.sources.avro-source1.type = avro

host1.sources.avro-source1.bind = 0.0.0.0

host1.sources.avro-source1.port = 41414

host1.sources.avro-source1.channels = ch1

# channel

host1.channels = ch_avro_log

host1.channels.ch_avro_log.type = memory

# sink

host1.sinks = log-sink1

host1.sinks.log-sink1.type = logger

host1.sinks.log-sink1.channel = ch1

Page 30: How to collect Big Data into Hadoop

Fluentd - Users

Page 31: How to collect Big Data into Hadoop

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

Page 32: How to collect Big Data into Hadoop

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

94 plugins!

Page 33: How to collect Big Data into Hadoop

Concept of Fluentd

Customization is essential> small core + many plugins

Fluentd core helps to implement plugins> common features are already implemented

Page 34: How to collect Big Data into Hadoop

Divide & Conquer

Retrying

Parallelize

Error handling

Message routing

Fluentd core Plugins

read / receive data

write / send data

Page 35: How to collect Big Data into Hadoop

Fluentd plugins

Page 36: How to collect Big Data into Hadoop

in_tail

fluentdapache

access.log

✓ read a log file✓ custom regexp✓ custom parser in Ruby

Page 37: How to collect Big Data into Hadoop

out_mongo

fluentdapache

access.log buffer

in_tail

Page 38: How to collect Big Data into Hadoop

out_mongo

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

in_tail

Page 39: How to collect Big Data into Hadoop

out_s3

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

Amazon S3

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

Page 40: How to collect Big Data into Hadoop

out_hdfs

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

HDFS

✓ custom text formater

Page 41: How to collect Big Data into Hadoop

out_hdfs

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

fluentd

fluentd

fluentd

✓ automatic fail-over✓ load balancing

Page 42: How to collect Big Data into Hadoop

Fluentd examples

Page 43: How to collect Big Data into Hadoop

Fluentd at Treasure Data - REST API logs

API servers

fluentdRails app

fluentd

fluentdRails app

fluent-logger-ruby+ in_forward

out_forward

watch server

Page 44: How to collect Big Data into Hadoop

Fluentd at Treasure Data - backend logs

API servers

fluentdRails app Ruby app

fluentd

fluentdRails app

worker servers

Ruby appfluentd

fluent-logger-ruby+ in_forward

out_forward

fluentdwatch server

Page 45: How to collect Big Data into Hadoop

Fluentd at Treasure Data - monitoring

API servers

fluentdRails app

fluentd

Queue

PerfectQueue

Ruby appfluentd

fluentdRails app

worker servers

Ruby appfluentd

fluent-logger-ruby+ in_forward

watch server

scriptout_forwardin_exec

Page 46: How to collect Big Data into Hadoop

Fluentd at Treasure Data - Hadoop logs

fluentd watch server

scriptin_exec

✓ resource consumption statistics for each user✓ capacity monitoring

thrift API call

HadoopJobTracker

Page 47: How to collect Big Data into Hadoop

Fluentd at Treasure Data - store & analyze

fluentd watch server

Librato Metricsfor realtime analysis

Treasure Datafor historical analysis

out_tdlog out_metricsense✓ streaming aggregation

Page 48: How to collect Big Data into Hadoop
Page 49: How to collect Big Data into Hadoop
Page 50: How to collect Big Data into Hadoop

Plugin development

Page 51: How to collect Big Data into Hadoop

class SomeInput < Fluent::Input Fluent::Plugin.register_input('myin', self)

config_param :tag, :string

def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end

def shutdown ... endend

<source> type myin tag myapp.api.heartbeat</source>

Page 52: How to collect Big Data into Hadoop

class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output('myout', self)

config_param :myparam, :string

def format(tag, time, record) [tag, time, record].to_json + "\n" end

def write(chunk) puts chunk.read endend

<match **> type myout myparam foobar</match>

Page 53: How to collect Big Data into Hadoop

class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input('mytail', self)

def configure_parser(conf) ... end

def parse_line(line) array = line.split(“\t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend

<source> type mytail</source>

Page 54: How to collect Big Data into Hadoop
Page 55: How to collect Big Data into Hadoop

Fluentd v11

Error stream

Streaming processing

Better DSL

Multiprocess