30
Apache Flume Loading Big Data into Hadoop Cluster using Flume Swapnil Dubey Big Data Hacker GoDataDriven

Apache flume by Swapnil Dubey

Embed Size (px)

DESCRIPTION

 

Citation preview

Apache FlumeLoading Big Data into Hadoop Cluster using Flume

Swapnil Dubey Big Data Hacker

GoDataDriven

Agenda

➢ What is Apache Flume?➢ Problem statement➢ Use Case : Collecting web server logs ➢ Overview/Architecture of Flume➢ Demos

What is Flume?

Collection & Aggregation of Streaming Data - Typically used for log data.

Advantages over other solutions:-➢ Scalable, Reliable, Customizable➢ Declarative and Dynamic Configuration➢ Contextual Routing➢ Feature Rich and Fully Extensible➢ Open source

Problem Statement

Problem Statement

LOGS

LOGS

LOGS

Application Servers Hadoop Cluster

Problem Statement

➢ Data collection is Ad hoc➢ How to get data to Hadoop➢ Streaming Data

Problem Statement

LOGS

LOGS

LOGS

Application Servers Hadoop Cluster

Flume Agent HDFS Write

Collecting web server logs.

➢ Collecting web logs using- - Single flume agent

- Using multiple flume agents ➢ Typical converging flow - Converging flow characteristics-Load Balancing, Multiplexing, Failover - Large converging flows

- Event volume

Problem Statement :Single Flume Agent

LOGS

LOGS

LOGS

Application Servers Hadoop Cluster

Flume Agent HDFS Write

Problem Statement:Multiple Flume Agent -1

LOGS

LOGS

LOGS

Application Servers Hadoop Cluster

Flume Agent HDFS Write

Flume Agent

Flume Agent

HDFS Write

HDFS Write

Problem Statement:Multiple Flume Agent -2

LOGS

LOGS

LOGS

Application Servers Hadoop Cluster

Flume Agent

HDFS WriteFlume Agent

Flume Agent

Flume Agent

Overview/Architecture of Flume

Components of Flume

Events

Client

Core Concepts

➢ Events➢ Client➢ Agents - Source, Channel, Sink - Interceptor - Channel Selector - Sink Processor

Core Concept:Event

An Event is the basic unit of data transported by Flume from source to destination.➢ Payload is opaque to Flume.➢ Events are accompanied by optional headers.

Headers: - Headers are collection of unique Key-Value pairs - Headers are used for contextual routing

Events

Client

Core Concept: Client

Entity that simulates event generation, passed to one or more agents.

➢ Example: Flume log4j Appender➢ Decouples Flume from the system where event data is generated.

Events

Client

Core Concepts: Agent

Container for hosting Sources, Channels, sinks and other components.

Core Concepts: Source

Component that receives events and places it onto one or more channels.➢ Different types of sources: - Specialized sources for integrating with well known systems.

For example -Syslog, Netcat - Auto generating Sources-Exec,SEQ - IPC Sources for Agent to Agent communication: Avro, Thrift

➢ Requires at least one Channel to function.

Core Concept: Channel

Component that buffers incoming events which are ultimately consumed by Sinks.

➢ Different channels:- Memory, File, Database➢ Channels are fully transactional.

Core Concepts: Sink

Component that takes events from channel and transmits them to next hop destination.

Different type of Sinks: - Terminal Sinks: HDFS,Hbase - Auto consuming Sinks: Null Sink - IPC sink : Agent to Agent communication-Avro, Thrift

Core Concepts:Interceptor

Interceptors are applied to sources in a predetermined fashion to enable adding information and filtering of events.

➢ Built in Interceptors: Allows adding headers such as timestamps, static markers etc.

➢ Custom Interceptors: Create headers by inspecting the Event.

Channel Selector

It facilitates selection of one or more Channels, based on preset criteria.

➢ Built in Channel Selectors: - Replicating: for duplicating events - Multiplexing: for routing based based on headers.

➢ Custom selectors can be written for dynamic criteria.

Sink Processor

Sink Processor is responsible for invoking one sink from a specified group of Sinks.

➢ Built in Sink Processors: - Load Balancing Sink Processor. - Failover Sink Processor - Default Sink Processor.

Data Ingest

Source Channel Processor

Interceptor

Channel Selector(decides for channels)

ChannelEvents

CLIENTS

EVENTS

Events filtered Eventsunfiltered Events

Data Drain

➢ Event Removal from Channel is transactional.

Sink Runner SinkSink

Processor

Channels

Sink selection n invocation

Send events to next hop

Next Hop

Agent Pipeline

* Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf

Assured Delivery

Agents use transactional exchange to guarantee delivery across hops.

Start Transaction

take Events

end transaction

SinkChannel Source Channel

Start Transaction

take Events

end transaction

Send events

Setting up a simple agent for HDFSagent.sources= netcat-collectagent.sinks = hdfs-writeagent.channels= memoryChannel

agent.sources.netcat-collect.type = netcatagent.sources.netcat-collect.bind = 127.0.0.1agent.sources.netcat-collect.port = 11111

agent.sinks.hdfs-write.type = hdfsagent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_testagent.sinks.hdfs-write.rollInterval = 30agent.sinks.hdfs-write.hdfs.writeFormat=Textagent.sinks.hdfs-write.hdfs.fileType=DataStream

agent.channels.memoryChannel.type = memoryagent.channels.memoryChannel.capacity=10000agent.sources.netcat-collect.channels=memoryChannelagent.sinks.hdfs-write.channel=memoryChannel

Advanced Features

Fan-In and Fan-Outhdfs-agent.channels=mchannel1 mchannel2hdfs-agent.sources.netcat-collect.selector.type = replicatinghdfs-agent.sources.r1.channels = mchannel1 mchannel2

Interceptorshdfs-agent.sources.netcat-collect.interceptors = filt_inthdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filterhdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.*hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true

Got BigData & Analytics work ? Contact [email protected]

We are hiring!!