276

Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Embed Size (px)

Citation preview

Page 1: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 2: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 3: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

Page 4: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TableofContents

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.OverviewandArchitecture

Flume0.9

Flume1.X(Flume-NG)

TheproblemwithHDFSandstreamingdata/logs

Sources,channels,andsinks

Flumeevents

Interceptors,channelselectors,andsinkprocessors

Tiereddatacollection(multipleflowsand/oragents)

TheKiteSDK

Summary

Page 5: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

2.AQuickStartGuidetoFlume

DownloadingFlume

FlumeinHadoopdistributions

AnoverviewoftheFlumeconfigurationfile

Startingupwith“Hello,World!”

Summary

3.Channels

Thememorychannel

Thefilechannel

SpillableMemoryChannel

Summary

4.SinksandSinkProcessors

HDFSsink

Pathandfilename

Filerotation

Compressioncodecs

EventSerializers

Textoutput

Textwithheaders

ApacheAvro

User-providedAvroschema

Filetype

SequenceFile

DataStream

CompressedStream

Timeoutsandworkers

Sinkgroups

Loadbalancing

Failover

MorphlineSolrSink

Morphlineconfigurationfiles

Page 6: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TypicalSolrSinkconfiguration

Sinkconfiguration

ElasticSearchSink

LogStashSerializer

DynamicSerializer

Summary

5.SourcesandChannelSelectors

Theproblemwithusingtail

TheExecsource

SpoolingDirectorySource

Syslogsources

ThesyslogUDPsource

ThesyslogTCPsource

ThemultiportsyslogTCPsource

JMSsource

Channelselectors

Replicating

Multiplexing

Summary

6.Interceptors,ETL,andRouting

Interceptors

Timestamp

Host

Static

Regularexpressionfiltering

Regularexpressionextractor

Morphlineinterceptor

Custominterceptors

Thepluginsdirectory

Tieringflows

TheAvrosource/sink

Page 7: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CompressingAvro

SSLAvroflows

TheThriftsource/sink

Usingcommand-lineAvro

TheLog4Jappender

TheLog4Jload-balancingappender

Theembeddedagent

Configurationandstartup

Sendingdata

Shutdown

Routing

Summary

7.PuttingItAllTogether

WeblogstosearchableUI

Settingupthewebserver

Configuringlogrotationtothespooldirectory

Settingupthetarget–Elasticsearch

SettingupFlumeoncollector/relay

SettingupFlumeontheclient

Creatingmoresearchfieldswithaninterceptor

Settingupabetteruserinterface–Kibana

ArchivingtoHDFS

Summary

8.MonitoringFlume

Monitoringtheagentprocess

Monit

Nagios

Monitoringperformancemetrics

Ganglia

InternalHTTPserver

Custommonitoringhooks

Page 8: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Summary

9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection

Transporttimeversuslogtime

Timezonesareevil

Capacityplanning

Considerationsformultipledatacenters

Complianceanddataexpiry

Summary

Index

Page 9: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 10: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

Page 11: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 12: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ApacheFlume:DistributedLogCollectionforHadoopSecondEditionCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:July2013

Secondedition:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-217-8

www.packtpub.com

Page 13: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 14: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CreditsAuthor

SteveHoffman

Reviewers

SachinHandiekar

MichaelKeane

StefanWill

CommissioningEditor

DipikaGaonkar

AcquisitionEditor

ReshmaRaman

ContentDevelopmentEditor

NeetuAnnMathew

TechnicalEditor

MenzaMathew

CopyEditors

VikrantPhadke

StutiSrivastava

ProjectCoordinator

MaryAlex

Proofreader

SimranBhogal

SafisEditing

Indexer

RekhaNair

Graphics

SheetalAute

AbhinashSahu

ProductionCoordinator

KomalRamchandani

CoverWork

Page 15: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

KomalRamchandani

Page 16: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 17: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

AbouttheAuthorSteveHoffmanhas32yearsofexperienceinsoftwaredevelopment,rangingfromembeddedsoftwaredevelopmenttothedesignandimplementationoflarge-scale,service-oriented,object-orientedsystems.Forthelast5years,hehasfocusedoninfrastructureascode,includingautomatedHadoopandHBaseimplementationsanddataingestionusingApacheFlume.SteveholdsaBSincomputerengineeringfromtheUniversityofIllinoisatUrbana-ChampaignandanMSincomputersciencefromDePaulUniversity.HeiscurrentlyaseniorprincipalengineeratOrbitzWorldwide(http://orbitz.com/).

MoreinformationonStevecanbefoundathttp://bit.ly/bacoboyandonTwitterat@bacoboy.

ThisisthefirstupdatetoSteve’sfirstbook,ApacheFlume:DistributedLogCollectionforHadoop,PacktPublishing.

I’dagainliketodedicatethisupdatedbooktomylovingandsupportivewife,Tracy.Sheputsupwithalot,andthatisverymuchappreciated.Icouldn’taskforabetterfrienddailybymyside.

Myterrificchildren,RachelandNoah,areaconstantreminderthathardworkdoespayoffandthatgreatthingscancomefromchaos.

Ialsowanttogiveabigthankstomyparents,AlanandKaren,formoldingmeintothesomewhatsatisfactoryhumanI’vebecome.TheirdedicationtofamilyandeducationaboveallelseguidesmedailyasIattempttohelpmyownchildrenfindtheirhappinessintheworld.

Page 18: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 19: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

AbouttheReviewersSachinHandiekarisaseniorsoftwaredeveloperwithover5yearsofexperienceinJavaEEdevelopment.HegraduatedincomputersciencefromtheUniversityofGreenwich,London,andcurrentlyworksforaglobalconsultingcompany,developingenterpriseapplicationsusingvariousopensourcetechnologies,suchasApacheCamel,ServiceMix,ActiveMQ,andZooKeeper.

Sachinhasalotofinterestinopensourceprojects.HehascontributedcodetoApacheCamelanddevelopedpluginsforSpringSocial,whichcanbefoundatGitHub(https://github.com/sachin-handiekar).

Healsoactivelywritesaboutenterpriseapplicationdevelopmentonhisblog(http://sachinhandiekar.com).

MichaelKeanehasaBSincomputersciencefromtheUniversityofIllinoisatUrbana-Champaign.Hehasworkedasasoftwareengineer,codingalmostexclusivelyinJavasinceJDK1.1.Hehasalsoworkedonthemission-criticalmedicaldevicesoftware,e-commerce,transportation,navigation,andadvertisingdomains.HeiscurrentlyadevelopmentleaderforConversant,wherehemaintainsFlumeflowsofnearly100billionloglinesperday.

Michaelisafatherofthree,andbesideswork,hespendsmostofhistimewithhisfamilyandcoachingyouthsoftball.

StefanWillisacomputerscientistwithadegreeinmachinelearningandpatternrecognitionfromtheUniversityofBonn,Germany.Foroveradecade,hehasworkedforseveralstart-upsinSiliconValleyandRaleigh,NorthCarolina,intheareaofsearchandanalytics.Presently,heleadsthedevelopmentofthesearchbackendandreal-timeanalyticsplatformatZendesk,aproviderofcustomerservicesoftware.

Page 20: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 21: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

www.PacktPub.com

Page 22: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 23: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 24: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandviewnineentirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 25: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 26: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

PrefaceHadoopisagreatopensourcetoolforshiftingtonsofunstructureddataintosomethingmanageablesothatyourbusinesscangainbetterinsightintoyourcustomers’needs.It’scheap(mostlyfree),scaleshorizontallyaslongasyouhavespaceandpowerinyourdatacenter,andcanhandleproblemsthatwouldcrushyourtraditionaldatawarehouse.Thatsaid,alittle-knownsecretisthatyourHadoopclusterrequiresyoutofeeditdata.Otherwise,youjusthaveaveryexpensiveheatgenerator!Youwillquicklyrealize(onceyougetpastthe“playingaround”phasewithHadoop)thatyouwillneedatooltoautomaticallyfeeddataintoyourcluster.Inthepast,youhadtocomeupwithasolutionforthisproblem,butnomore!FlumewasstartedasaprojectoutofCloudera,whenitsintegrationengineershadtokeepwritingtoolsoverandoveragainfortheircustomerstoautomaticallyimportdata.Today,theprojectliveswiththeApacheFoundation,isunderactivedevelopment,andboastsofuserswhohavebeenusingitintheirproductionenvironmentsforyears.

Inthisbook,IhopetogetyouupandrunningquicklywithanarchitecturaloverviewofFlumeandaquick-startguide.Afterthat,we’lldivedeepintothedetailsofmanyofthemoreusefulFlumecomponents,includingtheveryimportantfilechannelforthepersistenceofin-flightdatarecordsandtheHDFSSinkforbufferingandwritingdataintoHDFS(theHadoopFileSystem).SinceFlumecomeswithawidevarietyofmodules,chancesarethattheonlytoolyou’llneedtogetstartedisatexteditorfortheconfigurationfile.

Bythetimeyoureachtheendofthisbook,youshouldknowenoughtobuildahighlyavailable,fault-tolerant,streamingdatapipelinethatfeedsyourHadoopcluster.

Page 27: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

WhatthisbookcoversChapter1,OverviewandArchitecture,introducesFlumeandtheproblemspacethatit’stryingtoaddress(specificallywithregardstoHadoop).Anarchitecturaloverviewofthevariouscomponentstobecoveredinlaterchaptersisgiven.

Chapter2,AQuickStartGuidetoFlume,servestogetyouupandrunningquickly.ItincludesdownloadingFlume,creatinga“Hello,World!”configuration,andrunningit.

Chapter3,Channels,coversthetwomajorchannelsmostpeoplewilluseandtheconfigurationoptionsavailableforeachofthem.

Chapter4,SinksandSinkProcessors,goesintogreatdetailonusingtheHDFSFlumeoutput,includingcompressionoptionsandoptionsforformattingthedata.Failoveroptionsarealsocoveredsothatyoucancreateamorerobustdatapipeline.

Chapter5,SourcesandChannelSelectors,introducesseveraloftheFlumeinputmechanismsandtheirconfigurationoptions.Alsocoveredisswitchingbetweendifferentchannelsbasedondatacontent,whichallowsthecreationofcomplexdataflows.

Chapter6,Interceptors,ETL,andRouting,explainshowtotransformdatain-flightaswellasextractinformationfromthepayloadtousewithChannelSelectorstomakeroutingdecisions.ThenthischaptercoverstieringFlumeagentsusingAvroserialization,aswellasusingtheFlumecommandlineasastandaloneAvroclientfortestingandimportingdatamanually.

Chapter7,PuttingItAllTogether,walksyouthroughthedetailsofanend-to-endusecasefromthewebserverlogstoasearchableUI,backedbyElasticsearchaswellasarchivalstorageinHDFS.

Chapter8,MonitoringFlume,discussesvariousoptionsavailableformonitoringFlumebothinternallyandexternally,includingMonit,Nagios,Ganglia,andcustomhooks.

Chapter9,ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection,isacollectionofmiscellaneousthingstoconsiderthatareoutsidethescopeofjustconfiguringandusingFlume.

Page 28: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 29: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

WhatyouneedforthisbookYou’llneedacomputerwithaJavaVirtualMachineinstalled,sinceFlumeiswritteninJava.Ifyoudon’thaveJavaonyourcomputer,youcandownloaditfromhttp://java.com/.

YouwillalsoneedanInternetconnectionsothatyoucandownloadFlumetoruntheQuickStartexample.

ThisbookcoversApacheFlume1.5.2.

Page 30: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 31: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

WhothisbookisforThisbookisforpeopleresponsibleforimplementingtheautomaticmovementofdatafromvarioussystemstoaHadoopcluster.IfitisyourjobtoloaddataintoHadooponaregularbasis,thisbookshouldhelpyoutocodeyourselfoutofmanualmonkeyworkorfromwritingacustomtoolyou’llbesupportingforaslongasyouworkatyourcompany.

OnlybasicknowledgeofHadoopandHDFSisrequired.Somecustomimplementationsarecovered,shouldyourneedsnecessitatethem.Forthislevelofimplementation,youwillneedtoknowhowtoprograminJava.

Finally,you’llneedyourfavoritetexteditor,sincemostofthisbookcovershowtoconfigurevariousFlumecomponentsviaanagent’stextconfigurationfile.

Page 32: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 33: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andexplanationsoftheirmeanings.

Codewordsintextareshownasfollows:“Ifyouwanttousethisfeature,yousettheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.”

Ablockofcodeissetasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/access

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

agent.sources.s1.command=uptime

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Anycommand-lineinputoroutputiswrittenasfollows:

$tar-zxfapache-flume-1.5.2.tar.gz

$cdapache-flume-1.5.2

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 34: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 35: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.

Tosendusgeneralfeedback,simplysendane-mailto<[email protected]>,andmentionthebooktitleviathesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.

Page 36: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 37: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 38: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 39: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.

Page 40: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.

Page 41: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.

Page 42: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 43: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter1.OverviewandArchitectureIfyouarereadingthisbook,chancesareyouareswimminginoceansofdata.Creatingmountainsofdatahasbecomeveryeasy,thankstoFacebook,Twitter,Amazon,digitalcamerasandcameraphones,YouTube,Google,andjustaboutanythingelseyoucanthinkofbeingconnectedtotheInternet.Asaproviderofawebsite,10yearsago,yourapplicationlogswereonlyusedtohelpyoutroubleshootyourwebsite.Today,thissamedatacanprovideavaluableinsightintoyourbusinessandcustomersifyouknowhowtopangoldoutofyourriverofdata.

Furthermore,asyouarereadingthisbook,youarealsoawarethatHadoopwascreatedtosolve(partially)theproblemofsiftingthroughmountainsofdata.Ofcourse,thisonlyworksifyoucanreliablyloadyourHadoopclusterwithdataforyourdatascientiststopickapart.

GettingdataintoandoutofHadoop(inthiscase,theHadoopFileSystem,orHDFS)isn’thard;itisjustasimplecommand,suchas:

%hadoopfs--putdata.csv.

Thisworksgreatwhenyouhaveallyourdataneatlypackagedandreadytoupload.

However,yourwebsiteiscreatingdataallthetime.HowoftenshouldyoubatchloaddatatoHDFS?Daily?Hourly?Whateverprocessingperiodyouchoose,eventuallysomebodyalwaysasks“canyougetmethedatasooner?”Whatyoureallyneedisasolutionthatcandealwithstreaminglogs/data.

Turnsoutyouaren’taloneinthisneed.Cloudera,aproviderofprofessionalservicesforHadoopaswellastheirowndistributionofHadoop,sawthisneedoverandoverwhenworkingwiththeircustomers.Flumewascreatedtofillthisneedandcreateastandard,simple,robust,flexible,andextensibletoolfordataingestionintoHadoop.

Page 44: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Flume0.9FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.Itconsistedofafederationofworkerdaemons(agents)configuredfromacentralizedmaster(ormasters)viaZookeeper(afederatedconfigurationandcoordinationsystem).Fromthemaster,youcouldchecktheagentstatusinawebUIaswellaspushoutconfigurationcentrallyfromtheUIorviaacommand-lineshell(bothreallycommunicatingviaZookeepertotheworkeragents).

Datacouldbesentinoneofthreemodes:Besteffort(BE),DiskFailover(DFO),andEnd-to-End(E2E).ThemasterswereusedfortheE2Emodeacknowledgementsandmultimasterconfigurationneverreallymatured,soyouusuallyonlyhadonemaster,makingitacentralpointoffailureforE2Edataflows.TheBEmodeisjustwhatitsoundslike:theagentwouldtrytosendthedata,butifitcouldn’t,thedatawouldbediscarded.Thismodeisgoodforthingssuchasmetrics,wheregapscaneasilybetolerated,asnewdataisjustasecondaway.TheDFOmodestoresundeliverabledatatothelocaldisk(orsometimes,alocaldatabase)andwouldkeepretryinguntilthedatacouldbedeliveredtothenextrecipientinyourdataflow.Thisishandyforthoseplanned(orunplanned)outages,aslongasyouhavesufficientlocaldiskspacetobuffertheload.

InJune,2011,ClouderamovedcontroloftheFlumeprojecttotheApacheFoundation.Itcameoutoftheincubatorstatusayearlaterin2012.Duringtheincubationyear,workhadalreadybeguntorefactorFlumeundertheStar-Trek-themedtag,Flume-NG(FlumetheNextGeneration).

Page 45: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 46: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Flume1.X(Flume-NG)ThereweremanyreasonswhyFlumewasrefactored.Ifyouareinterestedinthedetails,youcanreadaboutthemathttps://issues.apache.org/jira/browse/FLUME-728.WhatstartedasarefactoringbrancheventuallybecamethemainlineofdevelopmentasFlume1.X.

ThemostobviouschangeinFlume1.Xisthatthecentralizedconfigurationmaster(s)andZookeeperaregone.TheconfigurationinFlume0.9wasoverlyverbose,andmistakeswereeasytomake.Furthermore,centralizedconfigurationwasreallyoutsidethescopeofFlume’sgoals.Centralizedconfigurationwasreplacedwithasimpleon-diskconfigurationfile(althoughtheconfigurationproviderispluggablesothatitcanbereplaced).Theseconfigurationfilesareeasilydistributedusingtoolssuchascf-engine,Chef,andPuppet.IfyouareusingaClouderadistribution,takealookatClouderaManagertomanageyourconfigurations.Abouttwoyearsago,theycreatedafreeversionwithnonodelimit,soitmaybeanattractiveoptionforyou.Justbesureyoudon’tmanagetheseconfigurationsmanually,oryou’llbeeditingthesefilesmanuallyforever.

AnothermajordifferenceinFlume1.Xisthatthereadingofinputdataandthewritingofoutputdataarenowhandledbydifferentworkerthreads(calledRunners).InFlume0.9,theinputthreadalsodidthewritingtotheoutput(exceptforfailoverretries).Iftheoutputwriterwasslow(ratherthanjustfailingoutright),itwouldblockFlume’sabilitytoingestdata.Thisnewasynchronousdesignleavestheinputthreadblissfullyunawareofanydownstreamproblem.

ThefirsteditionofthisbookcoveredalltheversionsofFlumeuptillVersion1.3.1.ThissecondeditionwillcovertillVersion1.5.2(thecurrentversionatthetimeofwritingthis).

Page 47: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 48: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheproblemwithHDFSandstreamingdata/logsHDFSisn’tarealfilesystem,atleastnotinthetraditionalsense,andmanyofthethingswetakeforgrantedwithnormalfilesystemsdon’tapplyhere,suchasbeingabletomountit.ThismakesgettingyourstreamingdataintoHadoopalittlemorecomplicated.

InaregularPOSIX-stylefilesystem,ifyouopenafileandwritedata,itstillexistsonthediskbeforethefileisclosed.Thatis,ifanotherprogramopensthesamefileandstartsreading,itwillgetthedataalreadyflushedbythewritertothedisk.Furthermore,ifthiswritingprocessisinterrupted,anyportionthatmadeittodiskisusable(itmaybeincomplete,butitexists).

InHDFS,thefileexistsonlyasadirectoryentry;itshowszerolengthuntilthefileisclosed.Thismeansthatifdataiswrittentoafileforanextendedperiodwithoutclosingit,anetworkdisconnectwiththeclientwillleaveyouwithnothingbutanemptyfileforallyourefforts.Thismayleadyoutotheconclusionthatitwouldbewisetowritesmallfilessothatyoucanclosethemassoonaspossible.

TheproblemisthatHadoopdoesn’tlikelotsoftinyfiles.AstheHDFSfilesystemmetadataiskeptinmemoryontheNameNode,themorefilesyoucreate,themoreRAMyou’llneedtouse.FromaMapReduceprospective,tinyfilesleadtopoorefficiency.Usually,eachMapperisassignedasingleblockofafileastheinput(unlessyouhaveusedcertaincompressioncodecs).Ifyouhavelotsoftinyfiles,thecostofstartingtheworkerprocessescanbedisproportionallyhighcomparedtothedataitisprocessing.ThiskindofblockfragmentationalsoresultsinmoreMappertasks,increasingtheoveralljobruntimes.

ThesefactorsneedtobeweighedwhendeterminingtherotationperiodtousewhenwritingtoHDFS.Iftheplanistokeepthedataaroundforashorttime,thenyoucanleantowardthesmallerfilesize.However,ifyouplanonkeepingthedataforaverylongtime,youcaneithertargetlargerfilesordosomeperiodiccleanuptocompactsmallerfilesintofewer,largerfilestomakethemmoreMapReducefriendly.Afterall,youonlyingestthedataonce,butyoumightrunaMapReducejobonthatdatahundredsorthousandsoftimes.

Page 49: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 50: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Sources,channels,andsinksTheFlumeagent’sarchitecturecanbeviewedinthissimplediagram.Inputsarecalledsourcesandoutputsarecalledsinks.Channelsprovidethegluebetweensourcesandsinks.Alloftheseruninsideadaemoncalledanagent.

NoteKeepinmind:

Asourcewriteseventstooneormorechannels.Achannelistheholdingareaaseventsarepassedfromasourcetoasink.Asinkreceiveseventsfromonechannelonly.Anagentcanhavemanychannels.

Page 51: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 52: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

FlumeeventsThebasicpayloadofdatatransportedbyFlumeiscalledanevent.Aneventiscomposedofzeroormoreheadersandabody.

Theheadersarekey/valuepairsthatcanbeusedtomakeroutingdecisionsorcarryotherstructuredinformation(suchasthetimestampoftheeventorthehostnameoftheserverfromwhichtheeventoriginated).YoucanthinkofitasservingthesamefunctionasHTTPheaders—awaytopassadditionalinformationthatisdistinctfromthebody.

Thebodyisanarrayofbytesthatcontainstheactualpayload.Ifyourinputiscomprisedoftailedlogfiles,thearrayismostlikelyaUTF-8-encodedstringcontainingalineoftext.

Flumemayaddadditionalheadersautomatically(likewhenasourceaddsthehostnamewherethedataissourcedorcreatinganevent’stimestamp),butthebodyismostlyuntouchedunlessyouedititenrouteusinginterceptors.

Page 53: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Interceptors,channelselectors,andsinkprocessorsAninterceptorisapointinyourdataflowwhereyoucaninspectandalterFlumeevents.Youcanchainzeroormoreinterceptorsafterasourcecreatesanevent.IfyouarefamiliarwiththeAOPSpringFramework,thinkMethodInterceptor.InJavaServlets,it’ssimilartoServletFilter.Here’sanexampleofwhatusingfourchainedinterceptorsonasourcemightlooklike:

Channelselectorsareresponsibleforhowdatamovesfromasourcetooneormorechannels.Flumecomespackagedwithtwochannelselectorsthatcovermostusecasesyoumighthave,althoughyoucanwriteyourownifneedbe.Areplicatingchannelselector(thedefault)simplyputsacopyoftheeventintoeachchannel,assumingyouhaveconfiguredmorethanone.Incontrast,amultiplexingchannelselectorcanwritetodifferentchannelsdependingonsomeheaderinformation.Combinedwithsomeinterceptorlogic,thisduoformsthefoundationforroutinginputtodifferentchannels.

Finally,asinkprocessoristhemechanismbywhichyoucancreatefailoverpathsforyoursinksorloadbalanceeventsacrossmultiplesinksfromachannel.

Page 54: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Tiereddatacollection(multipleflowsand/oragents)YoucanchainyourFlumeagentsdependingonyourparticularusecase.Forexample,youmaywanttoinsertanagentinatieredfashiontolimitthenumberofclientstryingtoconnectdirectlytoyourHadoopcluster.Morelikely,yoursourcemachinesdon’thavesufficientdiskspacetodealwithaprolongedoutageormaintenancewindow,soyoucreateatierwithlotsofdiskspacebetweenyoursourcesandyourHadoopcluster.

Inthefollowingdiagram,youcanseethattherearetwoplaceswheredataiscreated(ontheleft-handside)andtwofinaldestinationsforthedata(theHDFSandElasticSearchcloudbubblesontheright-handside).Tomakethingsmoreinteresting,let’ssayoneofthemachinesgeneratestwokindsofdata(let’scallthemsquareandtriangledata).Youcanseethatinthelower-leftagent,weuseamultiplexingchannelselectortosplitthetwokindsofdataintodifferentchannels.Therectanglechannelisthenroutedtotheagentintheupper-rightcorner(alongwiththedatacomingfromtheupper-leftagent).ThecombinedvolumeofeventsiswrittentogetherinHDFSinDatacenter1.Meanwhile,thetriangledataissenttotheagentthatwritestoElasticSearchinDatacenter2.Keepinmindthatdatatransformationscanoccurafteranysource.Howallofthesecomponentscanbeusedtobuildcomplicateddataworkflowswillbebecomeclearasweproceed.

Page 55: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 56: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheKiteSDKOneofthenewtechnologiesincorporatedinFlume,startingwithVersion1.4,issomethingcalledaMorphline.YoucanthinkofaMorphlineasaseriesofcommandschainedtogethertoformadatatransformationpipe.

IfyouareafanofpipeliningUnixcommands,thiswillbeveryfamiliartoyou.Thecommandsthemselvesareintendedtobesmall,single-purposefunctionsthatwhenchainedtogethercreatepowerfullogic.Inmanyways,usingaMorphlinecommandchaincanbeidenticalinfunctionalitytotheinterceptorparadigmjustmentioned.ThereisaMorphlineinterceptorwewillcoverinChapter6,Interceptors,ETL,andRouting,whichyoucanuseinsteadof,orinadditionto,theincludedJava-basedinterceptors.

NoteTogetanideaofhowusefulthesecommandscanbe,takealookatthehandygrokcommandanditsincludedextensibleregularexpressionlibraryathttps://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns

ManyofthecustomJavainterceptorsthatI’vewritteninthepastweretomodifythebody(data)andcaneasilybereplacedwithanout-of-the-boxMorphlinecommandchain.YoucangetfamiliarwiththeMorphlinecommandsbycheckingouttheirreferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

FlumeVersion1.4alsoincludesaMorphline-backedsinkusedprimarilytofeeddataintoSolr.We’llseemoreofthisinChapter4,SinksandSinkProcessors,MorphlineSolrSearchSink.

MorphlinesarejustonecomponentoftheKiteSDKincludedinFlume.StartingwithVersion1.5,FlumehasaddedexperimentalsupportforKiteData,whichisanefforttocreateastandardlibraryfordatasetsinHadoop.Itlooksverypromising,butitisoutsidethescopeofthisbook.

NotePleaseseetheprojecthomepageformoreinformation,asitwillcertainlybecomemoreprominentintheHadoopecosystemasthetechnologymatures.YoucanreadallabouttheKiteSDKathttp://kitesdk.org.

Page 57: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 58: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wediscussedtheproblemthatFlumeisattemptingtosolve:gettingdataintoyourHadoopclusterfordataprocessinginaneasilyconfigured,reliableway.WealsodiscussedtheFlumeagentanditslogicalcomponents,includingevents,sources,channelselectors,channels,sinkprocessors,andsinks.Finally,webrieflydiscussedMorphlinesasapowerfulnewETL(Extract,Transform,Load)library,startingwithVersion1.4ofFlume.

Thenextchapterwillcovertheseinmoredetail,specifically,themostcommonlyusedimplementationsofeach.Likeallgoodopensourceprojects,almostallofthesecomponentsareextensibleifthebundledonesdon’tdowhatyouneedthemtodo.

Page 59: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 60: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter2.AQuickStartGuidetoFlumeAswecoveredsomeofthebasicsinthepreviouschapter,thischapterwillhelpyougetstartedwithFlume.So,let’sstartwiththefirststep:downloadingandconfiguringFlume.

Page 61: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

DownloadingFlumeLet’sdownloadFlumefromhttp://flume.apache.org/.Lookforthedownloadlinkinthesidenavigation.You’llseetwocompressed.tararchivesavailablealongwiththechecksumandGPGsignaturefilesusedtoverifythearchives.Instructionstoverifythedownloadareonthewebsite,soIwon’tcoverthemhere.Checkingthechecksumfilecontentsagainsttheactualchecksumverifiesthatthedownloadwasnotcorrupted.Checkingthesignaturefilevalidatesthatallthefilesyouaredownloading(includingthechecksumandsignature)camefromApacheandnotsomenefariouslocation.Doyoureallyneedtoverifyyourdownloads?Ingeneral,itisagoodideaanditisrecommendedbyApachethatyoudoso.Ifyouchoosenotto,Iwon’ttell.

Thebinarydistributionarchivehasbininthename,andthesourcearchiveismarkedwithsrc.ThesourcearchivecontainsjusttheFlumesourcecode.ThebinarydistributionismuchlargerbecauseitcontainsnotonlytheFlumesourceandthecompiledFlumecomponents(jars,javadocs,andsoon),butalsoallthedependentJavalibraries.ThebinarypackagecontainsthesameMavenPOMfileasthesourcearchive,soyoucanalwaysrecompilethecodeevenifyoustartwiththebinarydistribution.

Goahead,downloadandverifythebinarydistributiontosaveussometimeingettingstarted.

Page 62: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

FlumeinHadoopdistributionsFlumeisavailablewithsomeHadoopdistributions.ThedistributionssupposedlyprovidebundlesofHadoop’scorecomponentsandsatelliteprojects(suchasFlume)inawaythatensuresthingssuchasversioncompatibilityandadditionalbugfixesaretakenintoaccount.Thesedistributionsaren’tbetterorworse;they’rejustdifferent.

Therearebenefitstousingadistribution.Someoneelsehasalreadydonetheworkofpullingtogetheralltheversion-compatiblecomponents.Today,thisislessofanissuesincetheApacheBigTopprojectstarted(http://bigtop.apache.org/).Nevertheless,havingprebuiltstandardOSpackages,suchasRPMsandDEBs,easeinstallationaswellasprovidestartup/shutdownscripts.Eachdistributionhasdifferentlevelsoffreeandpaidoptions,includingpaidprofessionalservicesifyoureallygetintoasituationyoujustcan’thandle.

Therearedownsides,ofcourse.TheversionofFlumebundledinadistributionwilloftenlagquiteabitbehindtheApachereleases.Ifthereisaneworbleeding-edgefeatureyouareinterestedinusing,you’lleitherbewaitingforyourdistribution’sprovidertobackportitforyou,oryou’llbestuckpatchingityourself.Furthermore,whilethedistributionprovidersdoafairamountoftesting,suchasanygeneral-purposeplatform,youwillmostlikelyencountersomethingthattheirtestingdidn’tcover,inwhichcase,youarestillonthehooktocomeupwithaworkaroundordiveintothecode,fixit,andhopefully,submitthatpatchbacktotheopensourcecommunity(where,atafuturepoint,it’llmakeitintoanupdateofyourdistributionorthenextversion).

So,thingsmoveslowerinaHadoopdistributionworld.Youcanseethatasgoodorbad.Usually,largecompaniesdon’tliketheinstabilityofbleeding-edgetechnologyormakingchangesoften,aschangecanbethemostcommoncauseofunplannedoutages.You’dbehardpressedtofindsuchacompanyusingthebleeding-edgeLinuxkernelratherthansomethinglikeRedHatEnterpriseLinux(RHEL),CentOS,UbuntuLTS,oranyoftheotherdistributionswhosetargetisstabilityandcompatibility.IfyouareastartupbuildingthenextInternetfad,youmightneedthatbleeding-edgefeaturetogetalegupontheestablishedcompetition.

Ifyouareconsideringadistribution,dotheresearchandseewhatyouaregetting(ornotgetting)witheach.Rememberthateachoftheseofferingsishopingthatyou’lleventuallywantand/orneedtheirEnterpriseoffering,whichusuallydoesn’tcomecheap.Doyourhomework.

NoteHere’sashort,nondefinitivelistofsomeofthemoreestablishedplayers.Formoreinformation,refertothefollowinglinks:

Cloudera:http://cloudera.com/Hortonworks:http://hortonworks.com/MapR:http://mapr.com/

Page 63: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 64: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

AnoverviewoftheFlumeconfigurationfileNowthatwe’vedownloadedFlume,let’sspendsometimegoingoverhowtoconfigureanagent.

AFlumeagent’sdefaultconfigurationproviderusesasimpleJavapropertyfileofkey/valuepairsthatyoupassasanargumenttotheagentuponstartup.Asyoucanconfiguremorethanoneagentinasinglefile,youwillneedtoadditionallypassanagentidentifier(calledaname)sothatitknowswhichconfigurationstouse.InmyexampleswhereI’monlyspecifyingoneagent,I’mgoingtousethenameagent.

NoteBydefault,theconfigurationpropertyfileismonitoredforchangesevery30seconds.Ifachangeisdetected,Flumewillattempttoreconfigureitself.Inpractice,manyoftheconfigurationsettingscannotbechangedaftertheagenthasstarted.Saveyourselfsometroubleandpasstheundocumented--no-reload-confargumentwhenstartingtheagent(exceptindevelopmentsituationsperhaps).

IfyouusetheClouderadistribution,thepassingofthisflagiscurrentlynotpossible.I’veopenedatickettofixthatathttps://issues.cloudera.org/browse/DISTRO-648.Ifthisisimportanttoyou,pleasevoteitup.

Eachagentisconfigured,startingwiththreeparameters:

agent.sources=<listofsources>

agent.channels=<listofchannels>

agent.sinks=<listofsinks>

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Eachsource,channel,andsinkalsohasauniquenamewithinthecontextofthatagent.Forexample,ifI’mgoingtotransportmyApacheaccesslogs,Imightdefineachannelnamedaccess.Theconfigurationsforthischannelwouldallstartwiththeagent.channels.accessprefix.EachconfigurationitemhasatypepropertythattellsFlumewhatkindofsource,channel,orsinkitis.Inthiscase,wearegoingtouseanin-memorychannelwhosetypeismemory.Thecompleteconfigurationforthechannelnamedaccessintheagentnamedagentwouldbe:

agent.channels.access.type=memory

Anyargumentstoasource,channel,orsinkareaddedasadditionalpropertiesusingthe

Page 65: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

sameprefix.ThememorychannelhasacapacityparametertoindicatethemaximumnumberofFlumeeventsitcanhold.Let’ssaywedidn’twanttousethedefaultvalueof100;ourconfigurationwouldnowlooklikethis:

agent.channels.access.type=memory

agent.channels.access.capacity=200

Finally,weneedtoaddtheaccesschannelnametotheagent.channelspropertysothattheagentknowstoloadit:

agent.channels=access

Let’slookatacompleteexampleusingthecanonicalHello,World!example.

Page 66: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 67: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Startingupwith“Hello,World!”NotechnicalbookwouldbecompletewithoutaHello,World!example.Hereistheconfigurationfilewe’llbeusing:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=netcat

agent.sources.s1.channels=c1

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

Here,I’vedefinedoneagent(calledagent)whohasasourcenameds1,achannelnamedc1,andasinknamedk1.

Thes1source’stypeisnetcat,whichsimplyopensasocketlisteningforevents(onelineoftextperevent).Itrequirestwoparameters:abindIPandaportnumber.Inthisexample,weareusing0.0.0.0forabindaddress(theJavaconventiontospecifylistenonanyaddress)andport12345.Thesourceconfigurationalsohasaparametercalledchannels(plural),whichisthenameofthechannel(s)thesourcewillappendeventsto,inthiscase,c1.Itisplural,becauseyoucanconfigureasourcetowritetomorethanonechannel;wejustaren’tdoingthatinthissimpleexample.

Thechannelnamedc1isamemorychannelwithadefaultconfiguration.

Thesinknamedk1isoftheloggertype.Thisisasinkthatismostlyusedfordebuggingandtesting.ItwilllogalleventsattheINFOlevelusingLog4j,whichitreceivesfromtheconfiguredchannel,inthiscase,c1.Here,thechannelkeywordissingularbecauseasinkcanonlybefeddatafromonechannel.

Usingthisconfiguration,let’sruntheagentandconnecttoitusingtheLinuxnetcatutilitytosendanevent.

First,explodethe.tararchiveofthebinarydistributionwedownloadedearlier:

$tar-zxfapache-flume-1.5.2-bin.tar.gz

$cdapache-flume-1.5.2-bin

Next,let’sbrieflylookatthehelp.Runtheflume-ngcommandwiththehelpcommand:

$./bin/flume-nghelp

Usage:./bin/flume-ng<command>[options]...

commands:

helpdisplaythishelptext

agentrunaFlumeagent

Page 68: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

avro-clientrunanavroFlumeclient

versionshowFlumeversioninfo

globaloptions:

--conf,-c<conf>useconfigsin<conf>directory

--classpath,-C<cp>appendtotheclasspath

--dryrun,-ddonotactuallystartFlume,justprintthecommand

--plugins-path<dirs>colon-separatedlistofplugins.ddirectories.See

the

plugins.dsectionintheuserguideformore

details.

Default:$FLUME_HOME/plugins.d

-Dproperty=valuesetsaJavasystempropertyvalue

-Xproperty=valuesetsaJava-Xoption

agentoptions:

--conf-file,-f<file>specifyaconfigfile(required)

--name,-n<name>thenameofthisagent(required)

--help,-hdisplayhelptext

avro-clientoptions:

--rpcProps,-P<file>RPCclientpropertiesfilewithserverconnection

params

--host,-H<host>hostnametowhicheventswillbesent

--port,-p<port>portoftheavrosource

--dirname<dir>directorytostreamtoavrosource

--filename,-F<file>textfiletostreamtoavrosource(default:std

input)

--headerFile,-R<file>Filecontainingeventheadersaskey/valuepairs

oneachnewline

--help,-hdisplayhelptext

Either--rpcPropsorboth--hostand--portmustbespecified.

Notethatif<conf>directoryisspecified,thenitisalwaysincluded

firstintheclasspath.

Asyoucansee,therearetwowayswithwhichyoucaninvokethecommand(otherthanthesimplehelpandversioncommands).Wewillbeusingtheagentcommand.Theuseofavro-clientwillbecoveredlater.

Theagentcommandhastworequiredparameters:aconfigurationfiletouseandtheagentname(incaseyourconfigurationcontainsmultipleagents).

Let’stakeoursampleconfigurationandopenaneditor(viinmycase,butusewhateveryoulike):

$viconf/hw.conf

Next,placethecontentsoftheprecedingconfigurationintotheeditor,save,andexitbacktotheshell.

Nowyoucanstarttheagent:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console

Page 69: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

The-Dflume.root.loggerpropertyoverridestherootloggerinconf/log4j.propertiestousetheconsoleappender.

Ifwedidn’toverridetherootlogger,everythingwouldstillwork,buttheoutputwouldgotothelog/flume.logfileinsteadofbeingbasedonthecontentsofthedefaultconfigurationfile.Ofcourse,youcanedittheconf/log4j.propertiesfileandchangetheflume.root.loggerproperty(oranythingelseyoulike).Tochangejustthepathorfilename,youcansettheflume.log.dirandflume.log.filepropertiesintheconfigurationfileorpassadditionalflagsonthecommandlineasfollows:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console-Dflume.log.dir=/tmp-

Dflume.log.file=flume-agent.log

Youmightaskwhyyouneedtospecifythe-cparameter,asthe-fparametercontainsthecompleterelativepathtotheconfiguration.ThereasonforthisisthattheLog4jconfigurationfileshouldbeincludedontheclasspath.

Ifyouleftthe-cparameteroffthecommand,you’llseethiserror:

Warning:Noconfigurationdirectoryset!Use--conf<dir>tooverride.

log4j:WARNNoappenderscouldbefoundforlogger

(org.apache.flume.lifecycle.LifecycleSupervisor).

log4j:WARNPleaseinitializethelog4jsystemproperly.

log4j:WARNSeehttp://logging.apache.org/log4j/1.2/faq.html#noconfigfor

moreinfo.

Butyoudidn’tdothatsoyoushouldseethesekeyloglines:

2014-10-0515:39:06,109(conf-file-poller-0)[INFO-

org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigu

ration.java:140)]Post-validationflumeconfigurationcontains

configurationforagents:[agent]

Thislinetellsyouthatyouragentstartswiththenameagent.

Usuallyyou’dlookforthislineonlytobesureyoustartedtherightconfigurationwhenyouhavemultipleconfigurationsdefinedinyourconfigurationfile.

2014-10-0515:39:06,076(conf-file-poller-0)[INFO-

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)]

Reloadingconfigurationfile:conf/hw.conf

Thisisanothersanitychecktomakesureyouareloadingthecorrectfile,inthiscaseourhw.conffile.

2014-10-0515:39:06,221(conf-file-poller-0)[INFO-

org.apache.flume.node.Application.startAllComponents(Application.java:138)]

Startingnewconfiguration:{sourceRunners:{s1=EventDrivenSourceRunner:{

source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE}}}

sinkRunners:{k1=SinkRunner:{

policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47counterGroup:{

name:nullcounters:{}}}}channels:

{c1=org.apache.flume.channel.MemoryChannel{name:c1}}}

Page 70: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Oncealltheconfigurationshavebeenparsed,youwillseethismessage,whichshowsyoueverythingthatwasconfigured.Youcansees1,c1,andk1,andwhichJavaclassesareactuallydoingthework.Asyouprobablyguessed,netcatisaconveniencefororg.apache.flume.source.NetcatSource.Wecouldhaveusedtheclassnameifwewanted.Infact,ifIhadmyowncustomsourcewritten,Iwoulduseitsclassnameforthesource’stypeparameter.YoucannotdefineyourownshortnameswithoutpatchingtheFlumedistribution.

2014-10-0515:39:06,427(lifecycleSupervisor-1-0)[INFO-

org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)]Created

serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]

Here,weseethatoursourceisnowlisteningonport12345fortheinput.So,let’ssendsomedatatoit.

Finally,openasecondterminal.We’llusethenccommand(youcanuseTelnetoranythingelsesimilar)tosendtheHelloWorldstringandpresstheReturn(Enter)keytomarktheendoftheevent:

%nclocalhost12345

HelloWorld

OK

TheOKmessagecamefromtheagentafterwepressedtheReturnkey,signifyingthatitacceptedthelineoftextasasingleFlumeevent.Ifyoulookattheagentlog,youwillseethefollowing:

2014-10-0515:44:11,215(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:48656C6C6F20576F726C64

HelloWorld}

ThislogmessageshowsyouthattheFlumeeventcontainsnoheaders(NetcatSourcedoesn’taddanyitself).Thebodyisshowninhexadecimalalongwithastringrepresentation(forushumanstoread,inthiscase,ourHelloWorldmessage).

IfIsendthefollowinglineandthenpresstheEnterkey,you’llgetanOKmessage:

Thequickbrownfoxjumpedoverthelazydog.

You’llseethisintheagent’slog:

2014-10-0515:44:57,232(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:54686520717569636B2062726F776E20

Thequickbrown}

Theeventappearstohavebeentruncated.Theloggersink,bydesign,limitsthebodycontentto16bytestokeepyourscreenfrombeingfilledwithmorethanwhatyou’dneedinadebuggingcontext.Ifyouneedtoseethefullcontentsfordebugging,youshoulduseadifferentsink,perhapsthefile_rollsink,whichwouldwritetothelocalfilesystem.

Page 71: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 72: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredhowtodownloadtheFlumebinarydistribution.Wecreatedasimpleconfigurationfilethatincludedonesourcewritingtoonechannel,feedingonesink.Thesourcelistenedonasocketfornetworkclientstoconnecttoandtosenditeventdata.Theseeventswerewrittentoanin-memorychannelandthenfedtoaLog4jsinktobecometheoutput.WethenconnectedtoourlisteningagentusingtheLinuxnetcatutilityandsentsomestringeventstoourFlumeagent’ssource.Finally,weverifiedthatourLog4j-basedsinkwrotetheeventsout.

Inthenextchapter,we’lltakeadetailedlookatthetwomajorchanneltypesyou’llmostlikelyuseinyourdataprocessingworkflows:thememorychannelandthefilechannel.

Wewillalsotakealookatanewexperimentalchannel,introducedinVersion1.5ofFlume,calledtheSpillableMemoryChannel,whichattemptstobeahybridoftheothertwo.

Foreachtype,we’lldiscussalltheconfigurationknobsavailabletoyou,whenandwhyyoumightwanttodeviatefromthedefaults,andmostimportantly,whytouseoneovertheother.

Page 73: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 74: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter3.ChannelsInFlume,achannelistheconstructusedbetweensourcesandsinks.Itprovidesabufferforyourin-flighteventsaftertheyarereadfromsourcesuntiltheycanbewrittentosinksinyourdataprocessingpipelines.

Theprimarytypeswe’llcoverhereareamemory-backed/nondurablechannelandalocal-filesystem-backed/durablechannel.StartingwithFlume1.5,anexperimentalhybridmemoryandfilechannelcalledtheSpillableMemoryChannelisintroduced.Thedurablefilechannelflushesallchangestodiskbeforeacknowledgingthereceiptoftheeventtothesender.Thisisconsiderablyslowerthanusingthenondurablememorychannel,butitprovidesrecoverabilityintheeventofsystemorFlumeagentrestarts.Conversely,thememorychannelismuchfaster,butfailureresultsindatalossandithasmuchlowerstoragecapacitywhencomparedtothemultiterabytedisksbackingthefilechannel.ThisiswhytheSpillableMemoryChannelwascreated.Intheory,yougetthebenefitsofmemoryspeeduntilthememoryfillsupduetoflowbackpressure.Atthispoint,thediskwillbeusedtostoretheevents—andwiththatcomesmuchlargercapacity.Therearetrade-offshereaswell,asperformanceisnowvariabledependingonhowtheentireflowisperforming.Ultimately,thechannelyouchoosedependsonyourspecificusecases,failurescenarios,andrisktolerance.

Thatsaid,regardlessofwhatchannelyouchoose,ifyourrateofingestfromthesourcesintothechannelisgreaterthantherateatwhichthesinkcanwritedata,youwillexceedthecapacityofthechannelandthrowaChannelException.Whatyoursourcedoesordoesn’tdowiththatChannelExceptionissource-specific,butinsomecases,datalossispossible,soyou’llwanttoavoidfillingchannelsbysizingthingsproperly.Infact,youalwayswantyoursinktobeabletowritefasterthanyoursourceinput.Otherwise,youmightgetintoasituationwhereonceyoursinkfallsbehind,youcannevercatchup.Ifyourdatavolumetrackswiththesiteusage,youcanhavehighervolumesduringthedayandlowervolumesatnight,givingyourchannelstimetodrain.Inpractice,you’llwanttotryandkeepthechanneldepth(thenumberofeventscurrentlyinthechannel)aslowaspossiblebecausetimespentinthechanneltranslatestoatimedelaybeforereachingthefinaldestination.

Page 75: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ThememorychannelAmemorychannel,asexpected,isachannelwherein-flighteventsarestoredinmemory.Asmemoryis(usually)ordersofmagnitudefasterthanthedisk,eventscanbeingestedmuchmorequickly,resultinginreducedhardwareneeds.Thedownsideofusingthischannelisthatanagentfailure(hardwareproblem,poweroutage,JVMcrash,Flumerestart,andsoon)resultsinthelossofdata.Dependingonyourusecase,thismightbeperfectlyfine.Systemmetricsusuallyfallintothiscategory,asafewlostdatapointsisn’ttheendoftheworld.However,ifyoureventsrepresentpurchasesonyourwebsite,thenamemorychannelwouldbeapoorchoice.

Tousethememorychannel,setthetypeparameteronyournamedchanneltomemory.

agent.channels.c1.type=memory

Thisdefinesamemorychannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String memory

capacity No int 100

transactionCapacity No int 100

byteCapacityBufferPercentage No int(percent) 20%

byteCapacity No long(bytes) 80%ofJVMHeap

keep-alive No int 3(seconds)

Thedefaultcapacityofthischannelis100events.Thiscanbeadjustedbysettingthecapacitypropertyasfollows:

agent.channels.c1.capacity=200

Rememberthatifyouincreasethisvalue,youwillmostlikelyhavetoincreaseyourJavaheapspaceusingthe-Xmx,andoptionally-Xms,parameters.

Anothercapacity-relatedsettingyoucansetistransactionCapacity.Thisisthemaximumnumberofeventsthatcanbewritten,alsocalledaput,byasource’sChannelProcessor,thecomponentresponsibleformovingdatafromthesourcetothechannel,inasingletransaction.Thisisalsothenumberofeventsthatcanberead,alsocalledatake,inasingletransactionbytheSinkProcessor,whichisthecomponentresponsibleformovingdatafromthechanneltothesink.Youmightwanttosetthishigherinordertodecreasetheoverheadofthetransactionwrapper,whichmightspeedthingsup.Thedownsidetoincreasingthisisthatasourcewouldhavetorollbackmoredataintheeventofafailure.

Page 76: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TipFlumeonlyprovidestransactionalguaranteesforeachchannelineachindividualagent.Inamultiagent,multichannelconfiguration,duplicatesandout-of-orderdeliveryarelikelybutshouldnotbeconsideredthenorm.Ifyouaregettingduplicatesinnonfailureconditions,itmeansthatyouneedtocontinuetuningyourFlumeconfigurations.

Ifyouareusingasinkthatwritessomeplacethatbenefitsfromlargerbatchesofwork(suchasHDFS),youmightwanttosetthishigher.Likemanythings,theonlywaytobesureistorunperformancetestswithdifferentvalues.ThisblogpostfromFlumecommitterMikePercyshouldgiveyousomegoodstartingpoints:http://bit.ly/flumePerfPt1.

ThebyteCapacityBufferPercentageandbyteCapacityparameterswereintroducedinhttps://issues.apache.org/jira/browse/FLUME-1535asameanstosizethememorychannelcapacityusingthenumberofbytesusedratherthanthenumberofeventsaswellastryingtoavoidOutOfMemoryErrors.Ifyoureventshavealargevarianceinsize,youmightbetemptedtousethesesettingstoadjustthecapacity,butbewarnedthatcalculationsareestimatedfromtheevent’sbodyonly.Ifyouhaveanyheaders,whichyouwill,youractualmemoryusagewillbehigherthantheconfiguredvalues.

Finally,thekeep-aliveparameteristhetimethethreadwritingdataintothechannelwillwaitwhenthechannelisfull,beforegivingup.Asdataisbeingdrainedfromthechannelatthesametime,ifspaceopensupbeforethetimeoutexpires,thedatawillbewrittentothechannelratherthanthrowinganexceptionbacktothesource.Youmightbetemptedtosetthisvalueveryhigh,butrememberthatwaitingforawritetoachannelwillblockthedataflowingintoyoursource,whichmightcausedatatobackupinanupstreamagent.Eventually,thismightresultineventsbeingdropped.Youneedtosizeforperiodicspikesintrafficaswellastemporaryplanned(andunplanned)maintenance.

Page 77: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 78: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ThefilechannelAfilechannelisachannelthatstoreseventstothelocalfilesystemoftheagent.Thoughit’sslowerthanthememorychannel,itprovidesadurablestoragepaththatcansurvivemostissuesandshouldbeusedinusecaseswhereagapinyourdataflowisundesirable.

ThisdurabilityisprovidedbyacombinationofaWriteAheadLog(WAL)andoneormorefilestoragedirectories.TheWALisusedtotrackallinputandoutputfromthechannelinanatomicallysafeway.Thisway,iftheagentisrestarted,theWALcanbereplayedtomakesurealltheeventsthatcameintothechannel(puts)havebeenwrittenout(takes)beforethestoreddatacanbepurgedfromthelocalfilesystem.

Additionally,thefilechannelsupportstheencryptionofdatawrittentothefilesystemifyourdatahandlingpolicyrequiresthatalldataonthedisk(eventemporarily)beencrypted.Iwon’tcoverthishere,butshouldyouneedit,thereisanexampleintheFlumeUserGuide(http://flume.apache.org/FlumeUserGuide.html).Keepinmindthatusingencryptionwillreducethethroughputofyourfilechannel.

Tousethefilechannel,setthetypeparameteronyournamedchanneltofile.

agent.channels.c1.type=file

Thisdefinesafilechannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String file

checkpointDir No String ~/.flume/file-channel/checkpoint

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentfromcheckpointDir

dataDirs No String(commaseparatedlist)

~/.flume/file-channel/data

capacity No int 1000000

keep-alive No int 3(seconds)

transactionCapacity No int 10000

checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

Page 79: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TospecifythelocationwheretheFlumeagentshouldholddata,setthecheckpointDiranddataDirsproperties:

agent.channels.c1.checkpointDir=/flume/c1/checkpoint

agent.channels.c1.dataDirs=/flume/c1/data

Technically,thesepropertiesarenotrequiredandhavesensibledefaultvaluesfordevelopment.However,ifyouhavemorethanonefilechannelconfiguredinyouragent,onlythefirstchannelwillstart.Forproductiondeploymentsanddevelopmentworkwithmultiplefilechannels,youshouldusedistinctdirectorypathsforeachfilechannelstorageareaandconsiderplacingdifferentchannelsondifferentdiskstoavoidIOcontention.Additionally,ifyouaresizingalargemachine,considerusingsomeformofRAIDthatcontainsstriping(RAID10,50,or60)toachievehigherdiskperformanceratherthanbuyingmoreexpensive10Kor15KdrivesorSSDs.Ifyoudon’thaveRAIDstripingbuthavemultipledisks,setdataDirstoacomma-separatedlistofeachstoragelocation.UsingmultiplediskswillspreadthedisktrafficalmostaswellasstripedRAIDbutwithoutthecomputationaloverheadassociatedwithRAID50/60aswellasthe50percentspacewasteassociatedwithRAID10.You’llwanttotestyoursystemtoseewhethertheRAIDoverheadisworththespeeddifference.Asharddrivefailuresareareality,youmightprefercertainRAIDconfigurationstosingledisksinordertoprotectyourselffromthedatalossassociatedwithsingledrivefailures.RAID6acrossthemaximumnumberofdiskscanprovidethehighestperformancefortheminimalamountofdataprotectionwhencombinedwithredundantandreliablepowersources(suchasanuninterruptablepowersupplyorUPS).

UsingtheJDBCchannelisabadideaasitwouldintroduceabottleneckandsinglepointoffailureinwhatshouldbedesignedasahighlydistributedsystem.NFSstorageshouldbeavoidedforthesamereason.

TipBesuretosetHADOOP_PREFIXandJAVA_HOMEenvironmentvariableswhenusingthefilechannel.Whileweseeminglyhaven’tusedanythingHadoop-specific(suchaswritingtoHDFS),thefilechannelusesHadoopWritablesasanon-diskserializationformat.IfFlumecan’tfindtheHadooplibraries,youmightseethisinyourstartup,socheckyourenvironmentvariables:java.lang.NoClassDefFoundError:org/apache/hadoop/io/Writable

StartingwithFlume1.4,thefilechannelsupportsasecondarycheckpointdirectory.Insituationswhereafailureoccurswhilewritingthecheckpointdata,thatinformationcouldbecomeunusableandafullreplayofthelogsisnecessaryinordertorecoverthestateofthechannel.Asonlyonecheckpointisupdatedatatimebeforeflippingtotheother,oneshouldalwaysbeinaconsistentstate,thusshorteningrestarttimes.Withoutvalidcheckpointinformation,theFlumeagentcan’tknowwhathasbeensentandwhathasnotbeensentindataDirs.Asfilesinthedatadirectoriesmightcontainlargeamountsofdataalreadysentbutnotyetdeleted,alackofcheckpointinformationwouldresultinalargenumberofrecordsbeingresentasduplicates.

Page 80: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Incomingdatafromsourcesiswrittenandacknowledgedattheendofthemostcurrentfileinthedatadirectory.Thefilesinthecheckpointdirectorykeeptrackofthisdatawhenit’stakenbyasink.

Ifyouwanttousethisfeature,settheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.Forperformancereasons,itisalwayspreferredthatthisbeonadifferentdiskfromtheotherdirectoriesusedbythefilechannel:

agent.channels.c1.useDualCheckpoints=true

agent.channels.c1.backupCheckpointDir=/flume/c1/checkpoint2

Thedefaultfilechannelcapacityisonemillioneventsregardlessofthesizeoftheeventcontents.Ifthechannelcapacityisreached,asourcewillnolongerbeabletoingestthedata.Thisdefaultshouldbefineforlowvolumecases.You’llwanttosizethishigherifyouringestionissoheavythatyoucan’ttoleratenormalplannedorunplannedoutages.Forinstance,therearemanyconfigurationchangesyoucanmakeinHadoopthatrequireaclusterrestart.IfyouhaveFlumewritingimportantdataintoHadoop,thefilechannelshouldbesizedtotoleratethetimeittakestorestartHadoop(andmaybeaddacomfortbufferfortheunexpected).Ifyourclusterorothersystemsareunreliable,youcansetthishigherstilltohandleevenlargeramountsofdowntime.Atsomepoint,you’llrunintothefactthatyourdiskspaceisafiniteresource,soyouwillhavetopicksomeupperlimit(orbuybiggerdisks).

Thekeep-aliveparameterissimilartomemorychannels.Itisthemaximumtimethesourcewillwaitwhentryingtowriteintoafullchannelbeforegivingup.Ifspacebecomesavailablebeforethetimeout,thewriteissuccessful;otherwise,ChannelExceptionisthrownbacktothesource.

ThetransactionCapacitypropertyisthemaximumnumberofeventsallowedinasingletransaction.Thismightbecomeimportantforcertainsourcesthatbatchtogethereventsandpassthemtothechannelinasinglecall.Mostlikely,youwon’tneedtochangethisfromthedefault.Settingthishigherallocatesadditionalresourcesinternally,soyoushouldn’tincreaseitunlessyourunintoperformanceissues.

ThecheckpointIntervalpropertyisthenumberofmillisecondsbetweenperformingacheckpoint(whichalsorollsthelogfileswrittentologDirs).Ifyoudonotsetthis,30secondswillbeused.

CheckpointfilesalsorollbasedonthevolumeofdatawrittentothemusingthemaxFileSizeproperty.Youcanlowerthisvalueforlowtrafficchannelsifyouwanttotryandsavesomediskspace.Let’ssayyourmaximumfilesizeis50,000bytesbutyourchannelonlywrites500bytesaday;itwouldtake100daystofillasinglelog.Let’ssaythatyouwereonday100and2000bytescameinallatonce.Somedatawouldbewrittentotheoldfileandanewfilewouldbestartedwiththeoverflow.Aftertheroll,Flumetriestoremoveanylogfilesthataren’tneededanymore.Asthefullloghasunprocessedrecords,itcannotberemovedyet.Thenextchancetocleanupthatoldlogfilemightnotcomeforanother100days.Itprobablydoesn’tmatterifthatold50,000bytefilesticks

Page 81: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

aroundlonger,butasthedefaultisaround2GB,youcouldhavetwicethat(4GB)diskspaceusedperchannel.Dependingonhowmuchdiskyouhaveavailableandthenumberofchannelsconfiguredinyouragent,thismightormightnotbeaproblem.Ifyourmachineshaveplentyofstoragespace,thedefaultshouldbefine.

Finally,theminimumRequiredSpacepropertyistheamountofspaceyoudonotwanttouseforwritinglogs.Thedefaultconfigurationwillthrowanexceptionifyouattempttousethelast500MBofthediskassociatedwiththedataDirpath.Thislimitappliesacrossallchannels,soifyouhavethreefilechannelsconfigured,theupperlimitisstill500MBandnot1.5GB.Youcansetthisvalueaslowas1MB,butgenerallyspeaking,badthingstendtohappenwhenyoupushdiskutilizationtowards100percent.

Page 82: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 83: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SpillableMemoryChannelIntroducedinFlume1.5,theSpillableMemoryChannelisachannelthatactslikeamemorychanneluntilitisfull.Atthatpoint,itactslikeafilechannelthatisconfiguredwithamuchlargercapacitythanitsmemorycounterpartbutrunsatthespeedofyourdisks(whichmeansordersofmagnitudeslower).

NoteTheSpillableMemoryChannelisstillconsideredexperimental.Useitatyourownrisk!

Ihavemixedfeelingsaboutthisnewchanneltype.Onthesurface,itseemslikeagoodidea,butinpractice,Icanseeproblems.Specifically,havingavariablechannelspeedthatchangesdependingonhowdownstreamentitiesinyourdatapipebehavemakesfordifficultcapacityplanning.Asamemorychannelisusedundergoodconditions,thisimpliesthatthedatacontainedinitcanbelost.SowhywouldIgothroughextratroubletosavesomeofittothedisk?Thedataiseitherveryimportantformetospoolittodiskwithafile-backedchannel,orit’slessimportantandcanbelost,soIcangetawaywithlesshardwareanduseafastermemory-backedchannel.IfIreallyneedmemoryspeedbutwiththecapacityofaharddrive,SolidStateDrive(SSD)priceshavecomedownenoughinrecentyearsforafilechannelonSSDtonowbeaviableoptionforyouratherthanusingthishybridchanneltype.Idonotusethischannelmyselfforthesereasons.

Tousethischannelconfiguration,setthetypeparameteronyournamedchanneltospillablememory:

agent.channels.c1.type=spillablememory

ThisdefinesaSpillableMemoryChannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String spillablememory

memoryCapacity No int 10000

overflowCapacity No int 100000000

overflowTimeout No int 3(seconds)

overflowDeactivationThreshold No int 5(percent)

byteCapacityBufferPercentage No int 20(percent)

byteCapacity No long(bytes) 80%ofJVMHeap

checkpointDir No String ~/.flume/file-channel/checkpoint

dataDirsNo String(comma

~/.flume/file-channel/data

Page 84: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

separatedlist)

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentthancheckpointDir

transactionCapacity No int 10000

checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

Asyoucansee,manyofthefieldsmatchagainstthememorychannelandfilechannel’sproperties,sothereshouldbenosurpriseshere.Let’sstartwiththememoryside.

ThememoryCapacitypropertydeterminesthemaximumnumberofeventsheldinthememory(thiswasjustcalledcapacityforthememorychannelbutwasrenamedheretoavoidambiguity).Also,thedefaultvalue,ifunspecified,is10,000recordsinsteadof100.Ifyouwantedtodoublethedefaultcapacity,theconfigurationmightlooksomethinglikethis:

agent.channels.c1.memoryCapacity=20000

Asmentionedpreviously,youwillmostlikelyneedtoincreasetheJavaheapspaceallocatedusingthe-Xmxand-Xmsparameters.LikemostJavaprograms,morememoryusuallyhelpsthegarbagecollectorrunmoreefficiently,especiallyasanapplicationsuchasFlumegeneratesalotofshort-livedobjects.BesuretodoyourresearchandpickanappropriateJVMgarbagecollectorbasedonyouravailablehardware.

TipYoucansetmemoryCapacitytozero,whicheffectivelyturnsitintoafilechannel.Don’tdothis.Justusethefilechannelandbedonewithit.

TheoverflowCapacitypropertydeterminesthemaximumnumberofeventsthatcanbewrittentothediskbeforeanerroristhrownbacktothesourcefeedingit.Thedefaultvalueisagenerous100,000,000events(farlargerthanthefilechannel’sdefaultof100,000).Mostserversnowadayshavemultiterabytedisks,sospaceshouldnotbeaproblem,butdothemathagainstyouraverageeventsizetobesureyoudon’tfillyourdisksbyaccident.Ifyourchannelsarefillingthismuch,youareprobablydealingwithanotherissuedownstreamandthelastthingyouneedisyourdatabufferlayerfillingupcompletely.Forexample,ifyouhadalarge1megabyteeventpayload,100millionoftheseaddupto100terabytes,whichisprobablybiggerthanthediskspaceonanaverageserver.A1kilobytepayloadwouldonlytake100gigabytes,whichisprobablyfine.Justdothemathaheadoftimesoyouarenotsurprised.

TipYoucansetoverflowCapacitytozero,whicheffectivelyturnsitintoamemorychannel.

Page 85: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Don’tdothis.Justusethememorychannelandbedonewithit.

ThetransactionCapacitypropertyadjuststhebatchsizeofeventswrittentoachannelinasingletransaction.Ifthisissettoolow,itwilllowerthethroughputoftheagentonhighvolumeflowsbecauseofthetransactionoverhead.Forahighvolumechannel,youwillprobablyneedtosetthishigher,buttheonlywaytobesureistotestyourparticularworkflow.SeeChapter8,MonitoringFlume,tolearnhowtoaccomplishthis.

Finally,thebyteCapacityBufferPercentageandbyteCapacityparametersareidenticalinfunctionalityanddefaultstothememorychannel,soIwon’twasteyourtimerepeatingithere.

WhatisimportantistheoverflowTimeoutproperty.Thisisthenumberofsecondsafterwhichthememorypartofthechannelfillsbeforedatastartsgettingwrittentothedisk-backedportionofthechannel.Ifyouwantwritestostartoccurringimmediately,youcansetthistozero.Youmightwonderwhyyouneedtowaitbeforestartingtowritetothediskportionofthechannel.ThisiswheretheundocumentedoverflowDeactivationThresholdpropertycomesintoplay.Thisistheamountoftimethatspacehastobeavailableinthememorypathbeforeitcanswitchbackfromdiskwriting.Ibelievethisisanattempttopreventflappingbackandforthbetweenthetwo.Ofcourse,therereallyarenoorderingguaranteesinFlume,soIdon’tknowwhyyouwouldchoosetoappendtothediskbufferifaspotisavailableinfastermemory.Perhapstheyaretryingtoavoidsomekindofstarvationcondition,althoughthecodeappearstoattempttoremoveeventsintheorderofarrivalevenwhenusingbothmemoryanddiskqueues.Perhapsitwillbeexplainedtousshoulditevercomeoutofexperimentalstatus.

NoteTheoverflowDeactivationThresholdpropertyisstatedtobeforinternaluseonly,soadjustitatyourownperil.Ifyouareconsideringit,besuretogetfamiliarwiththesourcecodesothatyouunderstandtheimplicationsofalteringthedefault.

Therestofthepropertiesonthischannelareidenticalinnameandfunctionalitytoitsfilechannelcounterpart,sopleaserefertotheprevioussection.

Page 86: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 87: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredthetwochanneltypesyouaremostlikelytouseinyourdataprocessingpipelines.

Thememorychanneloffersspeedatthecostofdatalossintheeventoffailure.Alternatively,thefilechannelprovidesamorereliabletransportinthatitcantolerateagentfailuresandrestartsataperformancecost.

Youwillneedtodecidewhichchannelisappropriateforyourusecases.Whentryingtodecidewhetheramemorychannelisappropriate,askyourselfwhatthemonetarycostisifyoulosesomedata.Weighthatagainsttheadditionalcostsofmorehardwaretocoverthedifferenceinperformancewhendecidingifyouneedadurablechannelafterall.Anotherconsiderationiswhetherornotthedatacanberesent.NotalldatayoumightingestintoHadoopwillcomefromstreamingapplicationlogs.Ifyoureceive“dailydownloads”ofdata,youcangetawaywithusingamemorychannelbecauseifyouencounteraproblem,youcanalwaysreruntheimport.

Finally,wecoveredtheexperimentalSpillableMemoryChannel.Personally,Ithinkitscreationisabadidea,butlikemostthingsincomputerscience,everybodyhasanopiniononwhatisgoodorbad.Ifeelthattheaddedcomplexityandnondeterministicperformancemakefordifficultcapacityplanning,asyoushouldalwayssizethingsfortheworst-casescenario.Ifyourdataiscriticalenoughforyoutooverflowtothediskratherthandiscardtheevents,thenyouaren’tgoingtobeokaywithlosingeventhesmallamountheldinmemory.

Inthenextchapter,we’lllookatsinks,specifically,theHDFSsinktowriteeventstoHDFS,theElasticSearchsinktowriteeventstoElasticSearch,andtheMorphlineSolrsinktowriteeventstoSolr.WewillalsocoverEventSerializers,whichspecifyhowFlumeeventsaretranslatedintooutputthat’smoresuitableforthesink.Finally,wewillcoversinkprocessorsandhowtosetuploadbalancingandfailurepathsinatieredconfigurationformorerobustdatatransport.

Page 88: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 89: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter4.SinksandSinkProcessorsBynow,youshouldhaveaprettygoodideawherethesinkfitsintotheFlumearchitecture.Inthischapter,wewillfirstlearnaboutthemost-usedsinkwithHadoop,theHDFSsink.WewillthencovertwoofthenewersinksthatsupportcommonNearRealTime(NRT)logprocessing:theElasticSearchSinkandtheMorphlineSolrSink.Asyou’dexpect,thefirstwritesdataintoElasticsearchandthelattertoSolr.ThegeneralarchitectureofFlumesupportsmanyothersinkswewon’thavespacetocoverinthisbook.SomecomebundledwithFlumeandcanwritetoHBase,IRC,and,aswesawinChapter2,AQuickStartGuidetoFlume,alog4jandfilesink.OthersinksareavailableontheInternetandcanbeusedtowritedatatoMongoDB,Cassandra,RabbitMQ,Redis,andjustaboutanyotherdatastoreyoucanthinkof.Ifyoucan’tfindasinkthatsuitsyourneeds,youcanwriteoneeasilybyextendingtheorg.apache.flume.sink.AbstractSinkclass.

Page 90: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

HDFSsinkThejoboftheHDFSsinkistocontinuouslyopenafileinHDFS,streamdataintoit,andatsomepoint,closethatfileandstartanewone.AswediscussedinChapter1,OverviewandArchitecture,thetimebetweenfilesrotationsmustbebalancedwithhowquicklyfilesareclosedinHDFS,thusmakingthedatavisibleforprocessing.Aswe’vediscussed,havinglotsoftinyfilesforinputwillmakeyourMapReducejobsinefficient.

TousetheHDFSsink,setthetypeparameteronyournamedsinktohdfs.

agent.sinks.k1.type=hdfs

ThisdefinesaHDFSsinknamedk1fortheagentnamedagent.Therearesomeadditionalparametersyoumustspecify,startingwiththepathinHDFSyouwanttowritethedatato:

agent.sinks.k1.hdfs.path=/path/in/hdfs

ThisHDFSpath,likemostfilepathsinHadoop,canbespecifiedinthreedifferentways:absolute,absolutewithservername,andrelative.Theseareallequivalent(assumingyourFlumeagentisrunastheflumeuser):

absolute /Users/flume/mydata

absolutewithserver hdfs://namenode/Users/flume/mydata

relative mydata

IprefertoconfigureanyserverI’minstallingFlumeonwithaworkinghadoopcommandlinebysettingthefs.default.namepropertyinHadoop’score-site.xmlfile.Idon’tkeeppersistentdatainHDFSuserdirectoriesbutprefertouseabsolutepathswithsomemeaningfulpathname(forexample,/logs/apache/access).TheonlytimeIwouldspecifyaNameNodespecificallyisifthetargetwasadifferentHadoopclusterentirely.Thisallowsyoutomoveconfigurationsyou’vealreadytestedinoneenvironmentintoanotherwithoutunintendedconsequencessuchasyourproductionserverwritingdatatoyourstagingHadoopclusterbecausesomebodyforgottoeditthetargetintheconfiguration.Iconsiderexternalizingenvironmentspecificsagoodbestpracticetoavoidsituationssuchasthese.

OnefinalrequiredparameterfortheHDFSsink,actuallyanysink,isthechannelthatitwillbedoingtakeoperationsfrom.Forthis,setthechannelparameterwiththechannelnametoreadfrom:

agent.sinks.k1.channel=c1

Thistellsthek1sinktoreadeventsfromthec1channel.

Hereisamostlycompletetableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

Page 91: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

type Yes String hdfs

channel Yes String

hdfs.path Yes String

hdfs.filePrefix No String FlumeData

hdfs.fileSuffix No String

hdfs.minBlockReplicas No intSeethedfs.replicationpropertyinyourinheritedHadoopconfiguration,usually,3.

hdfs.maxOpenFiles No long 5000

hdfs.closeTries No int 0(0=tryforever,otherwiseacount)

hdfs.retryInterval No int 180Seconds(0=don’tretry)

hdfs.round No boolean false

hdfs.roundValue No int 1

hdfs.roundUnit No String(second,

minuteorhour)second

hdfs.timeZone No String Localtime

hdfs.useLocalTimeStamp No boolean False

hdfs.inUsePrefix No String Blank

hdfs.inUseSuffix No String .tmp

hdfs.rollInterval No long(seconds) 30Seconds(0=disable)

hdfs.rollSize No long(bytes) 1024bytes(0=disable)

hdfs.rollCount No long 10(0=disable)

hdfs.batchSize No long 100

hdfs.codeC No String

RemembertoalwayschecktheFlumeUserGuidefortheversionyouareusingathttp://flume.apache.org/,asthingsmightchangebetweenthereleaseofthisbookandtheversionyouareactuallyusing.

Page 92: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

PathandfilenameEachtimeFlumestartsanewfileathdfs.pathinHDFStowritedatainto,thefilenameiscomposedofthehdfs.filePrefix,aperiodcharacter,theepochtimestampatwhichthefilewasstarted,andoptionally,afilesuffixspecifiedbythehdfs.fileSuffixproperty(ifset),forexample:

agent.sinks.k1.hdfs.path=/logs/apache/access

Theprecedingcommandwouldresultinafilesuchas/logs/apache/access/FlumeData.1362945258

However,inthefollowingconfiguration,yourfilenameswouldbemorelike/logs/apache/access/access.1362945258.log:

agent.sinks.k1.hdfs.path=/logs/apache/access

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Overtime,thehdfs.pathdirectorywillgetveryfull,soyouwillwanttoaddsomekindoftimeelementintothepathtopartitionthefilesintosubdirectories.Flumesupportsvarioustime-basedescapesequences,suchas%Ytospecifyafour-digityear.Iliketousesequencesintheyear/month/day/hourform(sothattheyaresortedoldesttonewest),soIoftenusethisforapath:

agent.sinks.k1.hdfs.path=/logs/apache/access/%Y/%m/%d/%H

ThissaysIwantapathlike/logs/apache/access/2013/03/10/18/.

NoteForacompletelistoftime-basedescapesequences,seetheFlumeUserGuide.

AnotherhandyescapesequencemechanismistheabilitytouseFlumeheadervaluesinyourpath.Forinstance,iftherewasaheaderwithakeyoflogType,IcouldsplitApacheaccessanderrorlogsintodifferentdirectorieswhileusingthesamechannel,byescapingtheheader’skeyasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%{logType}/%Y/%m/%d/%H

Theprecedinglineofcodewouldresultinaccesslogsgoingto/logs/apache/access/2013/03/10/18/,anderrorlogsgoingto/logs/apache/error/2013/03/10/18/.However,ifIpreferredbothlogtypesinthesamedirectorypath,IcouldhaveusedlogTypeinmyhdfs.filePrefixinstead,asfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

agent.sinks.k1.hdfs.filePrefix=%{logType}

Obviously,itispossibleforFlumetowritetomultiplefilesatonce.Thehdfs.maxOpenFilespropertysetstheupperlimitforhowmanycanbeopenatonce,withadefaultof5000.Ifyoushouldexceedthislimit,theoldestfilethat’sstillopenisclosed.RememberthateveryopenfileincursoverheadbothattheOSlevelandinHDFS(NameNodeandDataNodeconnections).

Page 93: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Anothersetofpropertiesyoumightfindusefulallowforroundingdowneventtimesatanhour,minute,orsecondgranularitywhilestillmaintainingtheseelementsinfilepaths.Let’ssayyouhadapathspecificationasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H%M

However,ifyouwantedonlyfoursubdirectoriesperday(at00,15,30,and45pastthehour,eachcontaining15minutesofdata),youcouldaccomplishthisbysettingthefollowing:

agent.sinks.k1.hdfs.round=true

agent.sinks.k1.hdfs.roundValue=15

agent.sinks.k1.hdfs.roundUnit=minute

Thiswouldresultinlogsbetween01:15:00and01:29:59onMarch10,2013beingwrittentofilescontainedin/logs/apache/2013/03/10/0115/.Logsfrom01:30:00to01:44:59wouldbewritteninfilescontainedin/logs/apache/2013/03/10/0130/.

Thehdfs.timeZonepropertyisusedtospecifythetimezonethatyouwanttimeinterpretedforyourescapesequences.Thedefaultisyourcomputer’slocaltime.Ifyourlocaltimeisaffectedbydaylightsavingstimeadjustments,youwillhavetwiceasmuchdatawhen%H==02(inthefall)andnodatawhen%H==02(inthespring).Ithinkitisabadideatointroducetimezonesintothingsthataremeantforcomputerstoread.Ibelievetimezonesareaconcernforhumansaloneandcomputersshouldonlyconverseinuniversaltime.Forthisreason,IsetthispropertyonmyFlumeagentstomakethetimezoneissuejustgoaway:

-Duser.timezone=UTC

Ifyoudon’tagree,youarefreetousethedefault(localtime)orsethdfs.timeZonetowhateveryoulike.Thevalueyoupassedisusedinacalltojava.util.Timezone.getTimeZone(…),sochecktheJavadocsforacceptablevaluestobeusedhere.

Theothertime-relatedpropertyisthehdfs.useLocalTimeStampbooleanproperty.Bydefault,itsvalueisfalse,whichtellsthesinktousetheevent’stimestampheaderwhencalculatingdate-basedescapesequencesinfilepaths,asshownpreviously.Ifyousetthepropertytotrue,thecurrentsystemtimewillbeusedinstead,effectivelytellingFlumetousethetransportarrivaltimeratherthantheoriginaleventtime.YouwouldnotsetthisincaseswhereHDFSwasthefinaltargetforthestreamedevents.Thisway,delayedeventswillstillbeplacedcorrectly(whereuserswouldnormallylookforthem)regardlessoftheirarrivaltime.However,theremaybeausecasewhereeventsaretemporarilywrittentoHadoopandprocessedinbatches,onsomeinterval(perhapsdaily).Inthiscase,thetransporttimewouldbepreferred,soyourpostprocessingjobdoesn’tneedtoscanolderfoldersfordelayeddata.

RememberthatfilesinHDFSarebrokenintofileblocksthatarereplicatedacrosstheDataNodes.Thedefaultnumberofreplicasisusuallythree(assetintheHadoopbaseconfiguration).Youcanoverridethisvalueupordownforthissinkwiththehdfs.minBlockReplicasproperty.Forexample,ifIhaveadatastreamthatIfeelonly

Page 94: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

needstworeplicasinsteadofthree,Icanoverridethisasfollows:

agent.skinks.k1.hdfs.minBlockReplicas=2

NoteDon’tsettheminimumreplicacounthigherthanthenumberofdatanodesyouhave,otherwiseyou’llcreateadegradedstateHDFS.Youalsodon’twanttosetitsohighthatadownedboxformaintenancewouldtriggerthissituation.Personally,I’veneversetthishigherthanthedefaultofthree,butIhavesetitloweronlessimportantdatainordertosavespace.

Finally,whilefilesarebeingwrittentoHDFS,a.tmpextensionisadded.Whenthefileisclosed,theextensionisremoved.Youcanchangetheextensionusedbysettingthehdfs.inUseSuffixproperty,butI’veneverhadareasontodoso:

agent.sinks.k1.hdfs.inUseSuffix=flumeiswriting

ThisallowsyoutoseewhichfilesarebeingwrittentosimplybylookingatadirectorylistinginHDFS.AsyoutypicallyspecifyadirectoryforinputinyourMapReducejob(orbecauseyouareusingHive),thetemporaryfileswilloftenbepickedupasemptyorgarbledinputbymistake.Toavoidhavingyourtemporaryfilespickedupbeforebeingclosed,settheprefixtoeitheradotoranunderscorecharacterasfollows:

agent.sinks.k1.hdfs.inUsePrefix=_

Thatsaid,thereareoccasionswherefileswerenotclosedproperlyduetosomeHDFSglitch,soyoumightseefileswiththein-useprefix/suffixthathaven’tbeenusedinsometime.AfewnewpropertieswereaddedinVersion1.5tochangethedefaultbehaviorofclosingfiles.Thefirstisthehdfs.closeTriesproperty.Thedefaultofzeroactuallymeans“tryforever”,soitisalittleconfusing.Settingitto4meanstry4timesbeforegivingup.Youcanadjusttheintervalbetweenretriesbysettingthehdfs.retryIntervalproperty.SettingittoolowcouldswampyourNameNodewithtoomanyrequests,sobecarefulifyoulowerthisfromthedefaultof3minutes.Ofcourse,ifyouareopeningfilestooquickly,youmightneedtolowerthisjusttokeepfromgoingoverthehdfs.maxOpenFilessettingwhichwascoveredpreviously.Ifyouactuallydidn’twantanyretries,youcansethdfs.retryIntervaltozeroseconds(again,nottobeconfusedwithcloseTries=0,whichmeanstryforever).Hopefullyinafutureversion,theywillusethemorecommonlyusedconventionofanegativenumber(usually,-1)wheninfiniteisdesired.

Page 95: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

FilerotationBydefault,Flumewillrotateactivelywritten-tofilesevery30seconds,10events,or1024bytes.Thisisdonebysettingthehdfs.rollInterval,hdfs.rollCount,andhdfs.rollSizeproperties,respectively.Oneormoreofthesecanbesettozerotodisablethisparticularrollingmechanism.Forinstance,ifyouonlywantedatime-basedrollof1minute,youwouldsetthefollowing:

agent.sinks.k1.hdfs.rollInterval=60

agent.sinks.k1.hdfs.rollCount=0

agent.sinks.k1.hdfs.rollSize=0

Ifyouroutputcontainsanyamountofheaderinformation,theHDFSsizeperfilecanbelargerthanwhatyouexpect,becausethehdfs.rollSizerotationschemeonlycountstheeventbodylength.Clearly,youmightnotwanttodisableallthreemechanismsforrotationatthesametime,oryouwillhaveonedirectoryinHDFSoverflowingwithfiles.

Finally,arelatedparameterishdfs.batchSize.Thisisthenumberofeventsthatthesinkwillreadpertransactionfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youmightseeaperformanceincreasebysettingthishigherthanthedefaultof100,whichdecreasesthetransactionoverheadperevent.

Nowthatwe’vediscussedthewayfilesaremanagedandrolledinHDFS,let’slookintohowtheeventcontentsgetwritten.

Page 96: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 97: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CompressioncodecsCodecs(Coder/Decoders)areusedtocompressanddecompressdatausingvariouscompressionalgorithms.Flumesupportsgzip,bzip2,lzo,andsnappy,althoughyoumighthavetoinstalllzoyourself,especiallyifyouareusingadistributionsuchasCDH,duetolicensingissues.

Ifyouwanttospecifycompressionforyourdata,setthehdfs.codeCpropertyifyouwanttheHDFSsinktowritecompressedfiles.ThepropertyisalsousedasthefilesuffixforthefileswrittentoHDFS.Forexample,ifyouspecifythefollowing,allfilesthatarewrittenwillhavea.gzipextension,soyoudon’tneedtospecifythehdfs.fileSuffixpropertyinthiscase:

agent.sinks.k1.hdfs.codeC=gzip

Thecodecyouchoosetousewillrequiresomeresearchonyourpart.Thereareargumentsforusinggziporbzip2fortheirhighercompressionratiosatthecostoflongercompressiontimes,especiallyifyourdataiswrittenoncebutwillbereadhundredsorthousandsoftimes.Ontheotherhand,usingsnappyorlzoresultsinfastercompressionperformancebutresultsinalowercompressionratio.Keepinmindthatthesplitabilityofthefile,especiallyifyouareusingplaintextfiles,willgreatlyaffecttheperformanceofyourMapReducejobs.GopickupacopyofHadoopBeginner’sGuide,GarryTurkington,PacktPublishing(http://amzn.to/14Dh6TA)orHadoop:TheDefinitiveGuide,TomWhite,O’Reilly(http://amzn.to/16OsfIf)ifyouaren’tsurewhatI’mtalkingabout.

Page 98: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 99: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

EventSerializersAnEventSerializeristhemechanismbywhichaFlumeEventisconvertedintoanotherformatforoutput.ItissimilarinfunctiontotheLayoutclassinlog4j.Bydefault,thetextserializer,whichoutputsjusttheFlumeeventbody,isused.Thereisanotherserializer,header_and_text,whichoutputsboththeheadersandthebody.Finally,thereisanavro_eventserializerthatcanbeusedtocreateanAvrorepresentationoftheevent.Ifyouwriteyourown,you’dusetheimplementation’sfullyqualifiedclassnameastheserializerpropertyvalue.

Page 100: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TextoutputAsmentionedpreviously,thedefaultserializeristhetextserializer.ThiswilloutputonlytheFlumeeventbody,withtheheadersdiscarded.Eacheventhasanewlinecharacterappenderunlessyouoverridethisdefaultbehaviorbysettingtheserializer.appendNewLinepropertytofalse.

Key Required Type Default

Serializer No String text

serializer.appendNewLine No boolean true

Page 101: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TextwithheadersThetext_with_headersserializerallowsyoutosavetheFlumeeventheadersratherthandiscardthem.Theoutputformatconsistsoftheheaders,followedbyaspace,thenthebodypayload,andfinally,terminatedbyanoptionallydisablednewlinecharacter,forinstance:

{key1=value1,key2=value2}bodytexthere

Key Required Type Default

serializer No String text_with_headers

serializer.appendNewLine No boolean true

Page 102: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ApacheAvroTheApacheAvroproject(http://avro.apache.org/)providesaserializationformatthatissimilarinfunctionalitytoGoogleProtocolBuffersbutismoreHadoopfriendlyasthecontainerisbasedonHadoop’sSequenceFileandhassomeMapReduceintegration.Theformatisalsoself-describingusingJSON,makingforagoodlong-termdatastorageformat,asyourdataformatmightevolveovertime.IfyourdatahasalotofstructureandyouwanttoavoidturningitintoStringsonlytothenparsetheminyourMapReducejob,youshouldreadmoreaboutAvrotoseewhetheryouwanttouseitasastorageformatinHDFS.

Theavro_eventserializercreatesAvrodatabasedontheFlumeeventschema.IthasnoformattingparametersasAvrodictatestheformatofthedata,andthestructureoftheFlumeeventdictatestheschemaused:

Key Required Type Default

serializer No String avro_event

serializer.compressionCodec No String(gzip,bzip2,lzo,orsnappy)

serializer.syncIntervalBytes No int(bytes) 2048000(bytes)

IfyouwantyourdatacompressedbeforebeingwrittentotheAvrocontainer,youshouldsettheserializer.compressionCodecpropertytothefileextensionofaninstalledcodec.Theserializer.syncIntervalBytespropertydeterminesthesizeofthedatabufferusedbeforeflushingthedatatoHDFS,andtherefore,thissettingcanaffectyourcompressionratiowhenusingacodec.HereisanexampleusingsnappycompressiononAvrodatausinga4MBbuffer:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=snappy

agent.sinks.k1.serializer.syncIntervalBytes=4194304

agent.sinks.k1.hdfs.fileSuffix=.avro

ForAvrofilestoworkinanAvroMapReducejob,theymustendin.avroortheywillbeignoredasinput.Forthisreason,youneedtoexplicitlysetthehdfs.fileSuffixproperty.Furthermore,youwouldnotsetthehdfs.codeCpropertyonanAvrofile.

Page 103: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

User-providedAvroschemaIfyouwanttouseadifferentschemafromtheFlumeeventschemausedwiththeavro_eventtype,startinginVersion1.4,thecloselynamedAvroEventSerializerwillletyoudothis.Keepinmindthatusingthisimplementationonly,theevent’sbodyisserializedandheadersarenotpassedon.

Settheserializertypetothefullyqualifiedorg.apache.flume.sink.hdfs.AvroEventSerializerclassname:

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

UnliketheotherserializersthattakeadditionalparametersintheFlumeconfigurationfile,thisonerequiresthatyoupasstheschemainformationviaaFlumeheader.ThisisabyproductofoneoftheAvro-awaresourceswe’llseeinChapter6,Interceptors,ETL,andRouting,whereschemainformationissentfromthesourcetothefinaldestinationviatheeventheader.Youcanfakethisifyouareusingasourcethatdoesn’tsetthesebyusingastaticheaderinterceptor.We’lltalkmoreaboutinterceptorsinChapter6,Interceptors,ETL,andRouting,soflipbacktothispartlateron.

TospecifytheschemadirectlyintheFlumeconfigurationfile,usetheflume.avro.schema.literalheaderasshowninthisexample(usingamapofstringsschema):

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.literal

agent.sinks.k1.interceptors.i1.value="

{\"type\":\"map\",\"values\":\"string\"}"

IfyouprefertoputtheschemafileinHDFS,usetheflume.avro.schema.urlheaderinstead,asshowninthisexample:

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.url

agent.sinks.k1.interceptors.i1.value=hdfs://path/to/schema.avsc

Actually,inthissecondform,youcanpassanyURLincludingafile://URL,butthiswouldindicateafilelocaltowhereyouarerunningtheFlumeagent,whichmightcreateadditionalsetupworkforyouradministrators.ThisisalsotrueofconfigurationservedupbyaHTTPwebserverorfarm.Ratherthancreatingadditionalsetupdependencies,justusethedependencyyoucannotremove,whichisHDFS,usingahdfs://URL.

Besuretoonlyseteithertheflume.avro.schema.literalheaderortheflume.avro.schema.urlheaderbothnotboth.

Page 104: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

FiletypeBydefault,theHDFSsinkwritesdatatoHDFSasHadoop’sSequenceFile.ThisisacommonHadoopwrapperthatconsistsofakeyandvaluefieldseparatedbybinaryfieldandrecorddelimiters.Usually,textfilesonacomputermakeassumptionslikeanewlinecharacterterminateseachrecord.So,whatdoyoudoifyourdatacontainsanewlinecharacter,suchassomeXML?Usingasequencefilecansolvethisproblembecauseitusesnonprintablecharactersfordelimiters.Sequencefilesarealsosplittable,whichmakesforbetterlocalityandparallelismwhenrunningMapReducejobsonyourdata,especiallyonlargefiles.

SequenceFileWhenusingaSequenceFilefiletype,youneedtospecifyhowyouwantthekeyandvaluetobewrittenontherecordintheSequenceFile.ThekeyoneachrecordwillalwaysbeaLongWritabletypeandwillcontainthecurrenttimestamp,orifthetimestampeventheaderisset,itwillbeusedinstead.Bydefault,theformatofthevalueisaorg.apache.hadoop.io.BytesWritabletype,whichcorrespondstothebyte[]Flumebody:

Key Required Type Default

hdfs.fileType No String SequenceFile

hdfs.writeFormat No String writable

However,ifyouwantthepayloadinterpretedasaString,youcanoverridethehdfs.writeFormatproperty,soorg.apache.hadoop.io.Textwillbeusedasthevaluefield:

Key Required Type Default

hdfs.fileType No String SequenceFile

hdfs.writeFormat No String text

DataStreamIfyoudonotwanttooutputaSequenceFilefilebecauseyourdatadoesn’thaveanaturalkey,youcanuseaDataStreamtooutputonlytheuncompressedvalue.Simplyoverridethehdfs.fileTypeproperty:

agent.sinks.k1.hdfs.fileType=DataStream

ThisisthefiletypeyouwouldusewithAvroserialization,asanycompressionshouldhavebeendoneintheEventSerializer.Toserializegzip-compressedAvrofiles,youwouldsettheseproperties:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=gzip

Page 105: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

agent.sinks.k1.hdfs.fileType=DataStream

agent.sinks.k1.hdfs.fileSuffix=.avro

CompressedStreamCompressedStreamissimilartoaDataStream,exceptthatthedataiscompressedwhenit’swritten.Youcanthinkofthisasrunningthegziputilityonanuncompressedfile,butallinonestep.ThisdiffersfromacompressedAvrofilewhosecontentsarecompressedandthenwrittenintoanuncompressedAvrowrapper:

agent.sinks.k1.hdfs.fileType=CompressedStream

RememberthatonlycertaincompressedformatsaresplittableinMapReduceshouldyoudecidetouseCompressedStream.Thecompressionalgorithmselectiondoesn’thaveaFlumeconfigurationbutisdictatedbythezlib.compress.strategyandzlib.compress.levelpropertiesincoreHadoopinstead.

Page 106: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TimeoutsandworkersFinally,therearetwomiscellaneouspropertiesrelatedtotimeoutsandtwoforworkerpoolsthatyoucanchange:

Key Required Type Default

hdfs.callTimeout No long(milliseconds) 10000

hdfs.idleTimeout No int(seconds) 0(0=disable)

hdfs.threadsPoolSize No int 10

hdfs.rollTimerPoolSize No int 1

Thehdfs.callTimeoutpropertyistheamountoftimetheHDFSsinkwillwaitforHDFSoperationstoreturnasuccess(orfailure)beforegivingup.IfyourHadoopclusterisparticularlyslow(forinstance,adevelopmentorvirtualcluster),youmightneedtosetthisvaluehigherinordertoavoiderrors.Keepinmindthatyourchannelwilloverflowifyoucannotsustainhigherwritethroughputthantheinputrateofyourchannel.

Thehdfs.idleTimeoutproperty,ifsettoanonzerovalue,isthetimeFlumewillwaittoautomaticallycloseanidlefile.Ihaveneverusedthisashdfs.fileRollIntervalhandlestheclosingoffilesforeachrollperiod,andifthechannelisidle,itwillnotopenanewfile.Thissettingseemstohavebeencreatedasanalternativerollmechanismtothesize,time,andeventcountmechanismsthathavealreadybeendiscussed.Youmightwantasmuchdatawrittentoafileaspossibleandonlycloseitwhentherereallyisnomoredata.Inthiscase,youcanusehdfs.idleTimeouttoaccomplishthisrotationschemeifyoualsosethdfs.rollInterval,hdfs.rollSize,andhdfs.rollCounttozero.

Thefirstpropertyyoucansettoadjustthenumberofworkersishdfs.threadsPoolSizeanditdefaultsto10.Thisisthemaximumnumberoffilesthatcanbewrittentoatthesametime.Ifyouareusingeventheaderstodeterminefilepathsandnames,youmighthavemorethan10filesopenatonce,butbecarefulwhenincreasingthisvaluetoomuchsoasnottooverwhelmHDFS.

Thelastpropertyrelatedtoworkerpoolsisthehdfs.rollTimerPoolSize.Thisisthenumberofworkersprocessingtimeoutssetbythehdfs.idleTimeoutproperty.Theamountofworktoclosethefilesisprettysmall,soincreasingthisvaluefromthedefaultofoneworkerisunlikely.Ifyoudonotusearotationbasedonhdfs.idleTimeout,youcanignorethehdfs.rollTimerPoolSizeproperty,asitisnotused.

Page 107: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 108: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SinkgroupsInordertoremovesinglepointsoffailuresinyourdataprocessingpipeline,Flumehastheabilitytosendeventstodifferentsinksusingeitherloadbalancingorfailover.Inordertodothis,weneedtointroduceanewconceptcalledasinkgroup.Asinkgroupisusedtocreatealogicalgroupingofsinks.Thebehaviorofthisgroupingisdictatedbysomethingcalledthesinkprocessor,whichdetermineshoweventsarerouted.

Thereisadefaultsinkprocessorthatcontainsasinglesinkwhichisusedwheneveryouhaveasinkthatisn’tpartofanysinkgroup.OurHello,World!exampleinChapter2,AQuickStartGuidetoFlume,usedthedefaultsinkprocessor.Nospecialconfigurationisrequiredforsinglesinks.

InorderforFlumetoknowaboutthesinkgroups,thereisanewtop-levelagentpropertycalledsinkgroups.Similartosources,channels,andsinks,youprefixthepropertywiththeagentname:

agent.sinkgroups=sg1

Here,wehavedefinedasinkgroupcalledsg1fortheagentnamedagent.

Foreachnamedsinkgroup,youneedtospecifythesinksitcontainsusingthesinkspropertyconsistingofaspace-delimitedlistofsinknames:

agent.sinkgroups.sg1.sinks=k1k2

Thisdefinesthatthek1andk2sinksarepartofthesg1sinkgroupfortheagentnamedagent.

Often,sinkgroupsareusedinconjunctionwiththetieredmovementofdatatoroutearoundfailures.However,theycanalsobeusedtowritetodifferentHadoopclusters,asevenawell-maintainedclusterhasperiodicmaintenance.

Page 109: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

LoadbalancingContinuingtheprecedingexample,let’ssayyouwanttoloadbalancetraffictok1andk2evenly.Therearesomeadditionalpropertiesyouneedtospecify,aslistedinthistable:

Key Type Default

processor.type String load_balance

processor.selector String(round_robin,random) round_robin

processor.backoff boolean false

Whenyousetprocessor.typetoload_balance,roundrobinselectionwillbeused,unlessotherwisespecifiedbytheprocessor.selectorproperty.Thiscanbesettoeitherround_robinorrandom.Youcanalsospecifyyourownloadbalancingselectormechanism,whichwewon’tcoverhere.ConsulttheFlumedocumentationifyouneedthiscustomcontrol.

Theprocessor.backoffpropertyspecifieswhetheranexponentialbackupshouldbeusedwhenretryingasinkthatthrewanexception.Thedefaultisfalse,whichmeansthatafterathrownexception,thesinkwillbetriedagainthenexttimeitsturnisupbasedonroundrobinorrandomselection.Ifsettotrue,thenthewaittimeforeachfailureisdoubled,startingat1seconduptoalimitofaround18hours(216seconds).

NoteInanearlierversionofFlume,thedefaultinthecodeforprocessor.backoffwasstatedasfalse,butthedocumentationstateditastrue.Thiserrorhasbeenfixed,however,itmaysaveyouaheadachebyspecifyingwhatyouwantforpropertysettingsratherthanrelyingonthedefaults.

Page 110: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

FailoverIfyouwouldrathertryonesinkandifthatonefailstotryanother,thenyouwanttosetprocessor.typetofailover.Next,you’llneedtosetadditionalpropertiestospecifytheorderbysettingtheprocessor.priorityproperty,followedbythesinkname:

Key Type Default

processor.type String failover

processor.priority.NAME int

processor.maxpenality int(milliseconds) 30000

Let’slookatthefollowingexample:

agent.sinkgroups.sg1.sinks=k1k2k3

agent.sinkgroups.sg1.processor.type=failover

agent.sinkgroups.sg1.processor.priority.k1=10

agent.sinkgroups.sg1.processor.priority.k2=20

agent.sinkgroups.sg1.processor.priority.k3=20

Lowerprioritynumberscomefirst,andinthecaseofatie,orderisarbitrary.Youcanuseanynumberingsystemthatmakessensetoyou(byones,fives,tens—whatever).Inthisexample,thek1sinkwillbetriedfirst,andifanexceptionisthrown,eitherk2ork3willbetriednext.Ifk3wasselectedfirstfortrialanditfailed,k2willstillbetried.Ifallsinksinthesinkgroupfail,thetransactionwiththechannelisrolledback.

Finally,processor.maxPenalitysetsanupperlimittoanexponentialbackoffforfailedsinksinthegroup.Afterthefirstfailure,itwillbe1secondbeforeitcanbeusedagain.Eachsubsequentfailuredoublesthewaittimeuntilprocessor.maxPenalityisreached.

Page 111: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 112: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MorphlineSolrSinkHDFSisnottheonlyusefulplacetosendyourlogsanddata.Solrisapopularreal-timesearchplatformusedtoindexlargeamountsofdata,sofulltextsearchingcanbeperformedalmostinstantaneously.Hadoop’shorizontalscalabilitycreatesaninterestingproblemforSolr,asthereisnowmoredatathanasingleinstancecanhandle.Forthisreason,ahorizontallyscalableversionofSolrwascreated,calledSolrCloud.Cloudera’sSearchproductisalsobasedonSolrCloud,soitshouldbenosurprisethatFlumedeveloperscreatedanewsinkspecificallytowritestreamingdataintoSolr.

Likemoststreamingdataflows,younotonlytransportthedata,butyoualsooftenreformatitintoaformmoreconsumabletothetargetoftheflow.Typically,thisisdoneinaFlume-onlyworkflowbyapplyingoneormoreinterceptorsjustpriortothesinkwritingthedatatothetargetsystem.ThissinkusestheMorphlineenginetotransformthedata,insteadofinterceptors.

Internally,eachFlumeeventisconvertedintoaMorphlinerecordandpassedtothefirstcommandintheMorphlinecommandchain.Arecordcanbethoughtofasasetofkey/valuepairswithstringkeysandarbitraryobjectvalues.EachoftheFlumeheadersispassedasaRecordFieldwiththesameheaderkeys.AspecialRecordFieldkey_attachment_bodyisusedfortheFlumeeventbody.Keepinmindthatthebodyisstillabytearray(Javabyte[])atthispointandmustbespecificallyprocessedintheMorphlinecommandchain.

Eachcommandprocessestherecordinturn,passingtheoutputtotheinputofthenextcommandinlinewiththefinalcommandresponsibleforterminatingtheflow.Inmanyways,itissimilarinfunctionallytoFlume’sInterceptorfunctionality,whichwe’llseeinChapter6,Interceptors,ETL,andRouting.InthecaseofwritingtoSolr,weusetheloadSolrcommandtoconverttheMorphlinerecordintoaSolrDocumentandwritetotheSolrcluster.Hereiswhatthissimplifiedflowmightlooklikeinapictureform:

Page 113: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MorphlineconfigurationfilesMorphlineconfigurationfilesusetheHOCONformat,whichissimilartoJSONbuthasalessstrictsyntax,makingthemlesserror-pronewhenusedforconfigurationfilesoverJSON.

NoteHOCONisanacronymforHumanOptimizedConfigurationObjectNotation.YoucanreadmoreaboutHOCONonthisGitHubpage:https://github.com/typesafehub/config/blob/master/HOCON.md

Theconfigurationfilecontainsasinglekeywiththemorphlinesvalue.ThevalueisanarrayofMorphlineconfigurations.Eachindividualentryiscomprisedofthreekeys:

id

importCommands

commands

IfyourconfigurationcontainsmultipleMorphlines,thevalueofidmustbeprovidedtotheFlumesinkbywayofthemorphlineIdproperty.ThevalueofimportCommandsspecifiestheJavaclassestoimportwhentheMorphineisevaluated.Thedoublestarindicatesthatallpathsandclassesfromthatpointinthepackagehierarchyshouldbeincluded.Allclassesthatimplementcom.cloudera.cdk.morphline.api.CommandBuilderareinterrogatedfortheirnamesviathegetNames()method.Thesenamesarethecommandnamesyouuseinthenextsection.Don’tworry;youdon’tneedtosiftthroughthesourcecodetofindthem,astheyhaveawell-documentedreferenceguideonline.Finally,thecommandskeyreferencesalistofcommanddictionaries.EachcommanddictionaryhasasinglekeyconsistingofthenameoftheMorphlinecommandfollowedbyitsspecificproperties.

NoteForalistofMorphlinecommandsandassociatedconfigurationproperties,seethereferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Hereiswhataskeletonconfigurationfilemightlooklike:

morphlines:[

{

id:transform_my_data

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

]

commands:[

{

COMMAND_NAME1:{

property1:value1

property2:value2

}

Page 114: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

}

{COMMAND_NAME2:{

property1:value1

}

]

}

]

Page 115: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TypicalSolrSinkconfigurationHereistheprecedingskeletonconfigurationappliedtoourSolrusecase.Thisisnotmeanttobecomplete,butitissufficienttodiscusstheflowintheprecedingdiagram:

morphlines:[

{

id:solr_flow

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

]

commands:[

{

readLine:{

charset:UTF-8

}

{

grok:{

GROK_PROPERTIES_HERE

}

}

{

loadSolr:{

solrLocator:{

collection:my_collection

zkHost:"solr.example.com:2181/solr"

}

}

}

]

}

]

YoucanseethesameboilerplateconfigurationwherewedefineasingleMorphlinewiththesolr_flowidentifier.ThecommandsequencestartswiththereadLinecommand.Thissimplyreadstheeventbodyfromthe_attachment_bodyfieldandconvertsbyte[]toStringusingtheconfiguredencoding(inthiscase,UTF-8).TheresultingStringvalueissettothefieldwiththekeymessage.Thenextcommandinthesequence,whichisthegrokcommand,usesregularexpressionstoextractadditionalfieldstomakeamoreinterestingSolrDocument.Icouldn’tpossiblydothiscommandjusticebytryingtoexplaineverythingyoucandowithit.Forthat,pleaseseetheKiteSDKdocumentation.

NoteSeethereferenceguideforacompletelistofMorphlinecommands,theirproperties,andusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Sufficetosay,grokletsmetakeawebserverloglinesuchasthis:10.4.240.176--[14/Mar/2014:12:02:17-0500]"POST

http://mysite.com/do_stuff.phpHTTP/1.1"500834

Page 116: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Then,itletsmeturnitintomorestructureddatalikethis:

{

ip:10.4.240.176

timestamp:1413306137

method:POST

url:http://mysite.com/do_stuff.php

protocol:HTTP/1.1

status_code:500

length:834

}

Ifyouwantedtosearchforallthetimesthispagethrewa500statuscode,havingthesefieldsbrokenoutmakesthetaskeasyforSolr.

Finally,wecalltheloadSolrcommandtoinserttherecordintoourSolrcluster.ThesolrLocatorpropertyindicatesthetargetSolrcluster(bywayofitsZookeeperserver(s))andthedatacollectiontowritethesedocumentsinto.

Page 117: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SinkconfigurationNowthatyouhaveabasicideaofhowtocreateaMorphlineconfigurationfile,let’sapplythistotheactualsinkconfiguration.

Thefollowingtabledetailsthesink’sparametersanddefaultvalues:

Key Required Type Default

type Yes String org.apache.flume.sink.solr.morphline.MorphlineSolrSink

channel Yes String

morphlineFile Yes String

morphlineId No StringRequirediftheMorphlineconfigurationfilecontainsmorethanoneMorphline.

batchSize No int 1000

batchDurationMillis No long 1000(milliseconds)

handlerClass No String org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl

TheMorphlineSolrSinkdoesnothaveashorttypealias,sosetthetypeparameteronyournamedsinktoorg.apache.flume.sink.solr.morphline.MorphlineSolrSink:

agent.sinks.k1.type=org.apache.flume.sink.solr.morphline.MorphlineSolrSink

ThisdefinesaMorphlineSolrSinknamedk1fortheagentnamedagent.

Thenextrequiredparameteristhechannelproperty.Thisspecifieswhichchanneltoreadeventsfromforprocessing.

agent.sinks.k1.channel=c1

Thistellsthek1sinktoreadeventsfromthec1channel.

TheonlyotherrequiredparameteristherelativeorabsolutepathtotheMorphlineconfigurationfile.ThiscannotbeapathinHDFS;itmustaccessibleontheservertheFlumeagentisrunningon(localdisk,NFSdisk,andsoon).

Tospecifytheconfigurationfilepath,setthemorphlineFileproperty:

agent.sinks.k1.morphlineFile=/path/on/local/system/morphline.conf

AsaMorphlineconfigurationfilecancontainmultipleMorphlines,youmustspecifytheidentifierifmorethanoneexists,usingthemorphlineIdproperty:

agent.sinks.k1.morphlineId=transform_my_data

Thenexttwopropertiesarefairlycommonamongsinks.Theyspecifyhowmanyeventstoremoveatatimeforprocessing,alsoknownasabatch.ThebatchSizepropertydefaultsto1000events,butyoumightneedtosetthishigherifyouaren’tconsuming

Page 118: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

eventsfromthechannelfasterthantheyarebeinginserted.Clearly,youcanonlyincreasethissomuch,asthethingyouarewritingto—inthiscase,Solr—willhavesomerecordconsumptionlimit.Onlythroughtestingwillyoubeabletostressyoursystemstoseewherethelimitsare.

TherelatedbatchDurationMillispropertyspecifiesthemaximumtimetowaitbeforethesinkproceedswiththeprocessingwhenfewerthanthebatchSizenumberofeventshavebeenread.Thedefaultvalueis1secondandisspecifiedinmillisecondsintheconfigurationproperties.Inasituationwithalightdataflow(usingthedefaults,lessthan1000recordspersecond),settingbatchDurationMillishighercanmakethingsworse.Forinstance,ifyouareusingamemorychannelwiththissink,yourFlumeagentcouldbesittingtherewithdatatowritetothesink’stargetbutiswaitingformore,onlytoshowupwhenacrashhappens,resultinginlostdata.Thatsaid,yourdownstreamentitymightperformbetteronlargerbatches,whichmightpushboththeseconfigurationvalueshigher,sothereisnouniversallycorrectanswer.Startwiththedefaultsifyouareunsure,anduseharddatathatyou’llcollectusingtechniquesinChapter8,MonitoringFlume,toadjustbasedonfactsandnotguesses.

Finally,youshouldneverneedtotouchthehandlerClasspropertyunlessyouplantowriteanalternateimplementationoftheMorphlineprocessingclass.AsthereisonlyoneMorphlineengineimplementationtodate,I’mnotreallysurewhythisisadocumentedpropertyinFlume.I’mjustmentioningitforcompleteness.

Page 119: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 120: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ElasticSearchSinkAnothercommontargettostreamdatatobesearchedinNRTisElasticsearch.ElasticsearchisalsoaclusteredsearchingplatformbasedonLucene,likeSolr.Itisoftenusedalongwiththelogstashproject(tocreatestructuredlogs)andtheKibanaproject(awebUIforsearches).ThistrioisoftenreferredtoastheacronymELK(Elasticsearch/Logstash/Kibana).

NoteHerearetheprojecthomepagesfortheELKstackthatcangiveyouamuchbetteroverviewthanIcaninafewshortpages:

Elasticsearch:http://elasticsearch.org/Logstash:http://logstash.net/Kibana:http://www.elasticsearch.org/overview/kibana/

InElasticsearch,dataisgroupedintoindices.YoucanthinkoftheseasbeingequivalenttodatabasesinasingleMySQLinstallation.Theindicesarecomposedoftypes(similartotablesindatabases),whicharemadeupofdocuments.Adocumentislikeasinglerowinadatabase,so,eachFlumeeventwillbecomeasingledocumentinElasticSearch.Documentshaveoneormorefields(justlikecolumnsinadatabase).

ThisisbynomeansacompleteintroductiontoElasticsearch,butitshouldbeenoughtogetyoustarted,assumingyoualreadyhaveanElasticsearchclusteratyourdisposal.Aseventsgetmappedtodocumentsbythesink’sserializer,theactualsinkconfigurationneedsonlyafewconfigurationitems:wheretheclusterislocated,whichindextowriteto,andwhattypetherecordis.

ThistablesummarizesthesettingsforElasticSearchSink:

Key Required Type Default

type Yes String org.apache.flume.sink.elasticsearch.ElasticSearchSink

hostNames Yes StringAcomma-separatedlistofElasticsearchnodestoconnectto.Iftheportisspecified,useacolonafterthename.Thedefaultportis9300.

clusterName No String elasticsearch

indexName No String flume

indexType No String log

ttl No String Defaultstoneverexpire.Specifythenumberandunit(5m=5minutes).

batchSize No int 100

Withthisinformationinmind,let’sstartbysettingthesink’stypeproperty:

agent.sinks.k1.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink

Page 121: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Next,weneedtosetthelistofserversandportstoestablishconnectivityusingthehostNamesproperty.Thisisacomma-separatedlistofhostname:portpairs.Ifyouareusingthedefaultportof9300,youcanjustspecifytheservernameorIP,forexample:

agent.sinks.k1.hostNames=es1.example.com,es2.example.com:12345

NowthatwecancommunicatewiththeElasticsearchservers,weneedtotellthemwhichcluster,index,andtypetowriteourdocumentsto.TheclusterisspecifiedusingtheclusterNameproperty.Thiscorrespondswiththecluster.namepropertyinElasticsearch’selasticsearch.ymlconfigurationfile.Itneedstobespecified,asanElasticsearchnodecanparticipateinmorethanonecluster.HereishowIwouldspecifyanondefaultclusternamecalledproduction:

agent.sinks.k1.clusterName=production

TheindexNamepropertyisreallyaprefixusedtocreateadailyindex.Thiskeepsanysingleindexfrombecomingtoolargeovertime.Ifyouusethedefaultindexname,theindexonSeptember30,2014willbenamedflume-2014-10-30.

Lastly,theindexTypepropertyspecifiestheElasticsearchtype.Ifunspecified,thelogdefaultvaluewillbeused.

Bydefault,datawrittenintoElasticsearchwillneverexpire.Ifyouwantthedatatoautomaticallyexpire,youcanspecifyatime-to-livevalueontherecordswiththettlproperty.Valuesareanumericnumberinmillisecondsoranumberwithunits.Theunitsaregiveninthistable:

Unitstring Definition Example

ms Milliseconds 5ms=5milliseconds

notspecified Milliseconds 10000=10seconds

m Minutes 10m=10minutes

h Hours 1h=1hour

d Days 7d=7days

w Weeks 4w=4weeks

KeepinmindthatyoualsoneedtoenabletheTTLfeaturesontheElasticsearchcluster,asitdisabledbydefault.SeetheElasticsearchdocumentationforhowtodothis.

Finally,liketheHDFSsink,thebatchpropertyisthenumberofeventspertransactionthatthesinkwillreadfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youshouldseeaperformanceincreasebysettingthishigherthanthedefaultof100,duetothereducedoverheadpertransaction.

Thesink’sserializerdoestheworkoftransformingtheFlumeeventtotheElasticsearchdocument.TherearetwoElasticsearchserializersthatcomepackagedwithFlume,neither

Page 122: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

hasadditionalconfigurationpropertiessincetheymostlyuseexistingheaderstodictatefieldmappings.

We’llseemoreofthissinkinactioninChapter7,PuttingItAllTogether.

Page 123: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

LogStashSerializerThedefaultserializer,ifnotspecified,isElasticSearchLogStashEventSerializer:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchLogStashEventSerializer

ItwritesdatainthesameformatthatLogstashusesinconjunctionwithKibana.HereisatableofthecommonlyusedfieldsandtheirassociatedmappingsfromFlumeevents:

Elasticsearchfield

TakenfromtheFlumeheader Notes

@timestamp timestamp Fromtheheader,ifpresent

@source source Fromtheheader,ifpresent

@source_host source_host Fromtheheader,ifpresent

@source_path source_path Fromtheheader,ifpresent

@type type Fromtheheader,ifpresent

@host host Fromtheheader,ifpresent

@fields allheaders AdictionaryofallFlumeheaders,includingtheonesthatmighthavebeenmappedtotheotherfields,suchasthehost

@message FlumeBody

Whileyoumightthinkthedocument’s@typefieldwillbeautomaticallysettothesink’sindexTypeconfigurationproperty,you’dbeincorrect.Ifyouhadonlyonetypeoflog,itwouldbewastefultowritethisoverandoveragainforeverydocument.However,ifyouhadmorethanonelogtypegoingthroughyourFlumechannel,youcandesignateitstypeinElasticsearchusingtheStaticinterceptorwe’llseeinChapter6,Interceptors,ETL,andRouting,tosetthetype(or@type)Flumeheaderontheevent.

Page 124: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

DynamicSerializerAnotherserializeristheElasticSearchDynamicSerializerserializer.Ifyouusethisserializer,theevent’sbodyiswrittentoafieldcalledbody.AllotherFlumeheaderkeysareusedasfieldnames.Clearly,youwanttoavoidhavingaflumeheaderkeycalledbody,asthiswillconflictwiththeactualevent’sbodywhentransformedintotheElasticsearchdocument.Tousethisserializer,specifythefullyqualifiedclassname,asshowninthisexample:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchDynamicSerializer

Forcompleteness,hereisatablethatshowsyouthebreakdownofhowFlumeheadersandbodygetmappedtoElasticsearchfields:

Flumeentity Elasticsearchfield

Allheaders sameasFlumeheaders

Body body

AstheversionofElasticsearchcanbedifferentforeachuser,Flumedoesn’tpackagetheElasticsearchclientandcorrespondingLucenelibraries.FindoutfromyouradministratorwhichversionsshouldbeincludedontheFlumeclasspath,orcheckouttheMavenpom.xmlfileonGitHubforthecorrespondingversiontagorbranchathttps://github.com/elasticsearch/elasticsearch/blob/master/pom.xml.MakesurethelibraryversionsusedbyFlumematchwithElasticsearchoryoumightseeserializationerrors.

NoteAsSolrandElasticsearchhavesimilarcapabilities,checkoutKelvinTan’sappropriately-namedside-by-sidedetailedfeaturebreakdownwebpage.Itshouldhelpgetyoustartedwithwhatismostappropriateforyourspecificusecase:

http://solr-vs-elasticsearch.com/

Page 125: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 126: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredtheHDFSsinkindepth,whichwritesstreamingdataintoHDFS.WecoveredhowFlumecanseparatedataintodifferentHDFSpathsbasedontimeorcontentsofFlumeheaders.Severalfile-rollingtechniqueswerealsodiscussed,includingtimerotation,eventcountrotation,sizerotation,androtationonidleonly.

CompressionwasdiscussedasameanstoreducestoragerequirementsinHDFS,andshouldbeusedwhenpossible.Besidesstoragesavings,itisoftenfastertoreadacompressedfileanddecompressinmemorythanitistoreadanuncompressedfile.ThiswillresultinperformanceimprovementsinMapReducejobsrunonthisdata.Thesplitabilityofcompresseddatawasalsocoveredasafactortodecidewhenandwhichcompressionalgorithmtouse.

EventSerializerswereintroducedasthemechanismbywhichFlumeeventsareconvertedintoanexternalstorageformat,includingtext(bodyonly),textandheaders(headersandbody),andAvroserialization(withoptionalcompression).

Next,variousfileformats,includingsequencefiles(Hadoopkey/valuefiles),DataStreams(uncompresseddatafiles,likeAvrocontainers),andCompressedDataStreams,werediscussed.

Next,wecoveredsinkgroupsasameanstorouteeventstodifferentsourcesusingloadbalancingorfailoverpaths,whichcanbeusedtoeliminatesinglepointsoffailureinroutingdatatoitsdestination.

Finally,wecoveredtwonewsinksaddedinFlume1.4towritedatatoApacheSolrandElasticSearchinaNearRealTime(NRT)way.Foryears,MapReducejobshaveserveduswell,andwillcontinuetodoso,butsometimesitstillisn’tfastenoughtosearchlargedatasetsquicklyandlookatthingsfromdifferentangleswithoutreprocessingdata.KiteSDKMorphlineswerealsointroducedasawaytopreparedataforwritingtoSolr.WewillrevisitMorphlinesagaininChapter6,Interceptors,ETL,andRouting,whenwelookataMorphline-poweredinterceptor.

Inthenextchapter,wewilldiscussvariousinputmechanisms(sources)thatwillfeedyourconfiguredchannelswhichwerecoveredbackinChapter3,Channels.

Page 127: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 128: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter5.SourcesandChannelSelectorsNowthatwehavecoveredchannelsandsinks,wewillnowcoversomeofthemorecommonwaystogetdataintoyourFlumeagents.AsdiscussedinChapter1,OverviewandArchitecture,thesourceistheinputpointfortheFlumeagent.TherearemanysourcesavailablewiththeFlumedistributionaswellasmanyopensourceoptionsavailable.Likemostopensourcesoftware,ifyoucan’tfindwhatyouneed,youcanalwayswriteyourownbyextendingtheorg.apache.flume.source.AbstractSourceclass.SincetheprimaryfocusofthisbookisingestingfilesoflogsintoHadoop,we’llcoverafewofthemoreappropriatesourcestoaccomplishthis.

Page 129: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheproblemwithusingtailIfyouhaveusedanyoftheFlume0.9releases,you’llnoticethattheTailSourceisnolongerapartofFlume.TailSourceprovidedamechanismto“tail”(http://en.wikipedia.org/wiki/Tail_(Unix))anyfileonthesystemandcreateFlumeeventsforeachlineofthefile.Itcouldalsohandlefilerotations,somanyusedthefilesystemasahandoffpointbetweentheapplicationcreatingthedata(forinstance,log4j)andthemechanismresponsibleformovingthosefilessomeplaceelse(forinstance,syslog).

Asisthecasewithbothchannelsandsinks,eventsareaddedandremovedfromachannelaspartofatransaction.Whenyouaretailingafile,thereisnowaytoparticipateproperlyinatransaction.Iffailuretowritesuccessfullytoachanneloccurred,orifthechannelwassimplyfull(amorelikelyeventthanfailure),thedatacouldn’tbe“putback”asrollbacksemanticsdictate.

Furthermore,iftherateofdatawrittentoafileexceedstherateFlumecouldreadthedata,itispossibletoloseoneormorelogfilesofinputoutright.Forexample,sayyouweretailing/var/log/app.log.Whenthatfilereachesacertainsize,itisrotatedorrenamed,to/var/log/app.log.1,andanewfilecalled/var/log/app.logiscreated.Let’ssayyouhadafavorablereviewinthepressandyourapplicationlogsaremuchhigherthanusual.Flumemaystillbereadingfromtherotatedfile(/var/log/app.log.1)whenanotherrotationoccurs,moving/var/log/app.logto/var/log/app.log.1.ThefileFlumeisreadingisnowrenamedto/var/log/app.log.2.WhenFlumefinisheswiththisfile,itwillmovetowhatitthinksisthenextfile(/var/log/app.log),thusskippingthefilethatnowresidesat/var/log/app.log.1.Thiskindofdatalosswouldgocompletelyunnoticedandissomethingwewanttoavoidifpossible.

Page 130: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Forthesereasons,itwasdecidedtoremovethetailfunctionalityfromFlumewhenitwasrefactored.TherearesomeworkaroundsforTailSourceafterit’sremoved,butitshouldbenotedthatnoworkaroundcaneliminatethepossibilityofdatalossunderloadthatoccursundertheseconditions.

Page 131: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 132: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheExecsourceTheExecsourceprovidesamechanismtorunacommandoutsideFlumeandthenturntheoutputintoFlumeevents.TousetheExecsource,setthetypepropertytoexec:

agent.sources.s1.type=exec

AllsourcesinFlumearerequiredtospecifythelistofchannelstowriteeventstousingthechannels(plural)property.Thisisaspace-separatedlistofoneormorechannelnames:

agent.sources.s1.channels=c1

Theonlyotherrequiredparameteristhecommandproperty,whichtellsFlumewhatcommandtopasstotheoperatingsystem.Hereisanexampleoftheuseofthisproperty:

agent.sources=s1

agent.sources.s1.channels=c1

agent.sources.s1.type=exec

agent.sources.s1.command=tail-F/var/log/app.log

Here,Ihaveconfiguredasinglesources1foranagentnamedagent.Thesource,anExecsource,willtailthe/var/log/app.logfileandfollowanyrotationsthatoutsideapplicationsmayperformonthatlogfile.Alleventsarewrittentothec1channel.ThisisanexampleofoneoftheworkaroundsforthelackofTailSourceinFlume1.x.Thisisnotmypreferredworkaroundtousetail,butjustasimpleexampleoftheexecsourcetype.IwillshowmypreferredmethodinChapter7,PullingItAllTogether.

NoteShouldyouusethetail-FcommandinconjunctionwiththeExecsource,itisprobablethattheforkedprocesswillnotshutdown100percentofthetimewhentheFlumeagentshutsdownorrestarts.Thiswillleaveorphanedtailprocessesthatwillneverexit.Thetail–Fcommand,bydefinition,hasnoend.Evenifyoudeletethefilebeingtailed(atleastinLinux),therunningtailprocesswillkeepthefilehandleopenindefinitely.Thiskeepsthefile’sspacefromactuallybeingreclaimeduntilthetailprocessexits,whichwon’thappen.IthinkyouarebeginningtoseewhyFlumedevelopersdon’tliketailingfiles.

Ifyougothisroute,besuretoperiodicallyscantheprocesstablesfortail-FwhoseparentPIDis1.Theseareeffectivelydeadprocessesandneedtobekilledmanually.

HereisalistofotherpropertiesyoucanusewiththeExecsource:

Key Required Type Default

type Yes String exec

channels Yes String spaceseparatedlistofchannels

command Yes String

Page 133: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

shell No String shellcommand

restart No boolean false

restartThrottle No long(milliseconds) 10000(milliseconds)

logStdErr No boolean false

batchSize No int 20

batchTimeout No long 3000(milliseconds)

Noteverycommandkeepsrunning,eitherbecauseitfails(forexample,whenthechannelitiswritingtoisfull)orbecauseitisdesignedtoexitimmediately.Inthisexample,wewanttorecordthesystemloadviatheLinuxuptimecommand,whichprintsoutsomesysteminformationtostdoutandexits:

agent.sources.s1.command=uptime

Thiscommandwillimmediatelyexit,soyoucanusetherestartandrestartThrottlepropertiestorunitperiodically:

agent.sources.s1.command=uptime

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Thiswillproduceoneeventperminute.Inthetailexample,shouldthechannelfillcausingtheExecsourcetofail,youcanusethesepropertiestorestarttheExecsource.Inthiscase,settingtherestartpropertywillstartthetailingofthefilefromthebeginningofthecurrentfile,thusproducingduplicates.DependingonhowlongtherestartThrottlepropertyis,youmayhavemissedsomedataduetoafilerotationoutsideFlume.Furthermore,thechannelmaystillbeunabletoacceptdata,inwhichcasethesourcewillfailagain.Settingthisvaluetoolowmeansgivinglesstimetothechanneltodrain,andunlikesomeofthesinkswesaw,thereisnotanoptionforexponentialbackoff.

Ifyouneedtouseshell-specificfeaturessuchaswildcardexpansion,youcansettheshellpropertyasinthisexample:

agent.sources.s1.command=grep–iapachelib/*.jar|wc-l

agent.sources.s1.shell=/bin/bash-c

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Thisexamplewillfindthenumberoftimesthecase-insensitiveapachestringisfoundinalltheJARfilesinthelibdirectory.Onceperminute,thatcountwillbesentasaFlumeeventpayload.

WhilethecommandoutputwrittentostdoutbecomestheFlumeeventbody,errorsaresometimeswrittentostderr.IfyouwanttheselinesincludedintheFlumeagent’ssystemlogs,setthelogStdErrpropertytotrue.Otherwise,theywillbesilentlyignored,whichisthedefaultbehavior.

Page 134: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Finally,youcanspecifythenumberofeventstowritepertransactionbychangingthebatchSizeproperty.Youmayneedtosetthisvaluehigherthanthedefaultof20ifyourinputdataislargeandyourealizethatyoucannotwritetoyourchannelfastenough.Usingahigherbatchsizereducestheoverallaveragetransactionoverheadperevent.Testingwithdifferentvaluesandmonitoringthechannel’sputrateistheonlywaytoknowthisforsure.TherelatedbatchTimeoutpropertysetsthemaximumtimetowaitwhenrecordsfewerthanthebatchsize’snumberofrecordshavebeenseenbefore,flushingapartialbatchtothechannel.Thedefaultsettingforthisis3seconds(specifiedinmilliseconds).

Page 135: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 136: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SpoolingDirectorySourceInanefforttoavoidalltheassumptionsinherentintailingafile,anewsourcewasdevisedtokeeptrackofwhichfileshavebeenconvertedintoFlumeeventsandwhichstillneedtobeprocessed.TheSpoolingDirectorySourceisgivenadirectorytowatchfornewfilesappearing.Itisassumedthatfilescopiedtothisdirectoryarecomplete.Otherwise,thesourcemighttryandsendapartialfile.Italsoassumesthatfilenamesneverchange.Otherwise,onrestarting,thesourcewouldforgetwhichfileshavebeensentandwhichhavenot.Thefilenameconditioncanbemetinlog4jusingDailyRollingFileAppenderratherthanRollingFileAppender.However,thecurrentlyopenfilewouldneedtobewrittentoonedirectoryandcopiedtothespooldirectoryafterbeingclosed.Noneofthelog4jappendersshippinghavethiscapability.

Thatsaid,ifyouareusingtheLinuxlogrotateprograminyourenvironment,thismightbeofinterest.Youcanmovecompletedfilestoaseparatedirectoryusingapostrotatescript.Thefinalflowmightlooksomethinglikethis:

TocreateaSpoolingDirectorySource,setthetypepropertytospooldir.YoumustspecifythedirectorytowatchbysettingthespoolDirproperty:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=spooldir

agent.sources.s1.spoolDir=/path/to/files

HereisasummaryofthepropertiesfortheSpoolingDirectorySource:

Key Required Type Default

type Yes String spooldir

Page 137: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

channels Yes String spaceseparatedlistofchannels

spoolDir Yes String pathtodirectorytospool

fileSuffix No String .COMPLETED

deletePolicy No String(neverorimmediate) never

fileHeader No boolean false

fileHeaderKey No String file

basenameHeader No boolean false

basenameHeaderKey No String basename

ignorePattern No String ^$

trackerDir No String ${spoolDir}/.flumespool

consumeOrder No String(oldest,youngest,orrandom) oldest

batchSize No int 10

bufferMaxLines No int 100

maxBufferLineLength No int 5000

maxBackoff No int 4000(milliseconds)

Whenafilehasbeencompletelytransmitted,itwillberenamedwitha.COMPLETEDextension,unlessoverriddenbysettingthefileSuffixproperty,likethis:

agent.sources.s1.fileSuffix=.DONE

StartingwithFlume1.4,anewproperty,deletePolicy,wascreatedtoremovecompletedfilesfromthefilesystemratherthanjustmarkingthemasdone.Inaproductionenvironment,thisiscriticalbecauseyourspooldiskwillfillupovertime.Currently,youcanonlysetthisforimmediatedeletionortoleavethefilesforever.Ifyouwantdelayeddeletion,you’llneedtoimplementyourownperiodic(cron)job,perhapsusingthefindcommandtofindfilesinthespooldirectorywiththeCOMPLETEDfilesuffixandamodificationtimelongerthansomeregularvalue,forexample:

find/path/to/spool/dir-typef-name"*.COMPLETED"–mtime7–execrm{}\;

Thiswillfindallfilescompletedmorethansevendaysagoanddeletethem.

Ifyouwanttheabsolutefilepathattachedtoeachevent,setthefileHeaderpropertytotrue.ThiswillcreateaheaderwiththefilekeyunlesssettosomethingelseusingthefileHeaderKeyproperty,likethiswouldaddthe{sourceFile=/path/to/files/foo.1234.log}headeriftheeventwasreadfromthe/path/to/files/foo.1234.logfile:

Page 138: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

agent.sources.s1.fileHeader=true

agent.sources.s1.fileHeaderKey=sourceFile

Therelatedproperty,basenameHeader,ifsettotrue,willaddaheaderwiththebasenamekey,whichcontainsjustthefilename.ThebasenameHeaderKeypropertyallowsyoutochangethekey’svalue,asshownhere:

agent.sources.s1.basenameHeader=true

agent.sources.s1.basenameHeaderKey=justTheName

Thisconfigurationwouldaddthe{justTheName=foo.1234.log}headeriftheeventwasreadfromthesamefilelocatedat/path/to/files/foo.1234.log.

Iftherearecertainfilepatternsthatyoudonotwantthissourcetoreadasinput,youcanpassaregularexpressionusingtheignorePatternproperty.Personally,Iwon’tcopyanyfilesIdon’twanttransferredtoFlumeinthespoolDirinthefirstplace.Ifthissituationcannotbeavoided,usetheignorePatternpropertytopassaregularexpressiontomatchfilenamesthatshouldnotbetransferredasdata.Furthermore,subdirectoriesandfilesthatstartwitha"."(period)characterareignored,soyoucanavoidcostlyregularexpressionprocessingusingthisconventioninstead.

WhileFlumeissendingdata,itkeepstrackofhowfarithasgottenineachfilebykeepingametadatafileinthedirectoryspecifiedbythetrackerDirproperty.Bydefault,thisfilewillbe.flumespoolunderspoolDir.ShouldyouwantalocationotherthaninsidespoolDir,youcanspecifyanabsolutefilepath.

Filesinthedirectoryareprocessedinthe“oldestfirst”mannerascalculatedbylookingatthemodificationtimesofthefiles.YoucanchangethisbehaviorbysettingtheconsumeOrderproperty.Ifyousetthispropertytoyoungest,thenewestfileswillbeprocessedfirst.Thismaybedesiredifdataistimesensitive.Ifyou’drathergiveequalprecedencetoallfiles,youcansetthispropertytoavalueofrandom.

ThebatchSizepropertyallowsyoutotunethenumberofeventspertransactionforwritestothechannel.Increasingthismayprovidebetterthroughputatthecostoflargertransactions(andpossiblylargerrollbacks).ThebufferMaxLinespropertyisusedtosetthesizeofthememorybufferusedinreadingfilesbymultiplyingitwithmaxBufferLineLength.Ifyourdataisveryshort,youmightconsiderincreasingbufferMaxLineswhilereducingthemaxBufferLineLengthproperty.Inthiscase,itwillresultinbetterthroughputwithoutincreasingyourmemoryoverhead.Thatsaid,ifyouhaveeventslongerthan5000characters,you’llwanttosetmaxBufferLineLengthhigher.

Ifthereisaproblemwritingdatatothechannel,aChannelExceptionisthrownbacktothesource,whereit’llretryafteraninitialwaittimeof250ms.Eachfailedattemptwilldoublethistimeuptoamaximumof4seconds.Tosetthismaximumtimehigher,setthemaxBackoffproperty.Forinstance,ifIwantedamaximumof5minutes,Icansetitinmillisecondslikethis:

agent.sources.s1.maxBackoff=300000

Finally,you’llwanttoensurethatwhatevermechanismiswritingnewfilestoyourspoolingdirectorycreatesuniquefilenames,suchasaddingatimestamp(andpossibly

Page 139: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

more).Reusingafilenamewillconfusethesource,andyourdatamaynotbeprocessed.

Asalways,rememberthatrestartsanderrorswillcreateduplicatesduetoretransmissionoffilespartiallysent,butnotmarkedascomplete,orbecausethemetadataisincomplete.

Page 140: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 141: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SyslogsourcesSysloghasbeenaroundfordecadesandisoftenusedasanoperating-system-levelmechanismtocaptureandmovelogsaroundsystems.Inmanyways,thereareoverlapswithsomeofthefunctionalityFlumeprovides.ThereisevenaHadoopmoduleforrsyslog,oneofthemoremodernvariantsofsyslog(http://www.rsyslog.com/doc/rsyslog_conf_modules.html/omhdfs.html).Generally,Idon’tlikesolutionsthatcoupletechnologiesthatmayversionindependently.Ifyouusethisrsyslog/Hadoopintegration,youwouldberequiredtoupdatetheversionofHadoopyoucompiledintorsyslogatthesametimeyouupgradedyourHadoopclustertoanewmajorversion.Thismaybelogisticallydifficultifyouhavealargenumberofserversand/orenvironments.BackwardcompatibilityinHadoopwireprotocolsissomethingthatisbeingactivelyworkedonintheHadoopcommunity,butcurrently,itisn’tthenorm.We’lltalkmoreaboutthisinChapter8,MonitoringFlume,whenwediscusstieringdataflows.

SysloghasanolderUDPtransportaswellasanewerTCPprotocolthatcanhandledatalargerthanasingleUDPpacketcantransmit(about64KB)anddealwithnetwork-relatedcongestioneventsthatmightrequirethedatatoberetransmitted.

Finally,therearesomeundocumentedpropertiesofsyslogsourcesthatallowustoaddmoreregular-expressionpattern-matchingformessagesthatdonotconformtoRFCstandards.Iwon’tbediscussingtheseadditionalsettings,butyoushouldbeawareofthemifyourunintofrequentparsingerrors.Inthiscase,takealookatthesourcefororg.apache.flume.source.SyslogUtilsforimplementationdetailstofindthecause.

Moredetailsonsyslogterms(suchasafacility)andstandardformatscanbefoundinRFC3164athttp://tools.ietf.org/html/rfc3164.

Page 142: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ThesyslogUDPsourceTheUDPversionofsyslogisusuallysafetousewhenyouarereceivingdatafromtheserver’slocalsyslogprocess,providedthedataissmallenough(lessthanabout64KB).

NoteTheimplementationforthissourcehaschosen2,500bytesasthemaximumpayloadsizeregardlessofwhatyournetworkcanactuallyhandle.Soifyourpayloadwillbelargerthanthis,useoneoftheTCPsourcesinstead.

TocreateaSyslogUDPsource,setthetypepropertytosyslogudp.Youmustsettheporttolistenonusingtheportproperty.Theoptionalhostpropertyspecifiesthebindaddress.Ifnohostisspecified,allIPsfortheserverwillbeused,whichisthesameasspecifying0.0.0.0.Inthisexample,wewillonlylistenforlocalUDPconnectionsonport5140:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=syslogudp

agent.sources.s1.host=localhost

agent.sources.s1.port=5140

Ifyouwantsyslogtoforwardatailedfile,youcanaddalinelikethistoyoursyslogconfigurationfile:

*.err;*.alert;*.crit;*.emerg;kern.*@localhost:5140

Thiswillsendallerrorpriority,criticalpriority,emergencypriority,andkernelmessagesofanypriorityintoyourFlumesource.Thesingle@symboldesignatesthatUDPprotocolshouldbeused.

HereisasummaryofthepropertiesoftheSyslogUDPsource:

Key Required Type Default

type Yes String syslogudp

channels Yes String spaceseparatedlistofchannels

port Yes int

host No String 0.0.0.0

keepFields No boolean false

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogUDPsourcearesummarizedhere:

HeaderKey Description

Page 143: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogevent,translatedintoanepochtimestamp.It’somittedifit’snotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Itisomittedifitisnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisvalueissettoInvalidifthepayloaddidn’tconformtotheRFCs,andsettoIncompleteifthemessagewaslongerthantheeventSizevalue(forUDP,thisissetinternallyto2,500bytes).It’somittedifeverythingisfine.

Page 144: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ThesyslogTCPsourceAspreviouslymentioned,theSyslogTCPsourceprovidesanendpointformessagesoverTCP,allowingforalargerpayloadsizeandTCPretrysemanticsthatshouldbeusedforanyreliableinter-servercommunications.

TocreateaSyslogTCPsource,setthetypepropertytosyslogtcp.Youmuststillsetthebindaddressandporttolistenon:

agent.sources=s1

agent.sources.s1.type=syslogtcp

agent.sources.s1.host=0.0.0.0

agent.sources.s1.port=12345

IfyoursyslogimplementationsupportssyslogoverTCP,theconfigurationisusuallythesame,exceptthatadouble@symbolisusedtoindicateTCPtransport.HereisthesameexampleusingTCP,whereIamforwardingthevaluestoaFlumeagentthatisrunningonadifferentservernamedflume-1.

*.err;*.alert;*.crit;*.emerg;kern.*@@flume-1:12345

TherearesomeoptionalpropertiesfortheSyslogTCPsource,aslistedhere:

Key Required Type Default

type Yes String syslogtcp

channels Yes String spaceseparatedlistofchannels

port Yes int

host No String 0.0.0.0

keepFields No boolean false

eventSize No int(bytes) 2500bytes

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogTCPsourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.It’somittedifnotparsedfromoneofthestandardRFCformats.

hostname Theparsedhostnameinthesyslogmessage.It’somittedifnotparsed.

Page 145: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.It’ssettoInvalidifthepayloaddidn’tconformtotheRFCsandsettoIncompleteifthemessagewaslongerthantheconfiguredeventSize.It’somittedifeverythingisfine.

Page 146: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ThemultiportsyslogTCPsourceTheMultiportSyslogTCPsourceisnearlyidenticalinfunctionalitytotheSyslogTCPsource,exceptthatitcanlistentomultipleportsforinput.Youmayneedtousethiscapabilityifyouareunabletochangewhichportsyslogwilluseinitsforwardingrules(itmaynotbeyourserveratall).Itismorelikelythatyouwillusethistoreadmultipleformatsusingonesourcetowritetodifferentchannels.We’llcoverthatinamomentintheChannelSelectorssection.Underthehood,ahigh-performanceasynchronousTCPlibrarycalledMina(https://mina.apache.org/)isused,whichoftenprovidesbetterthroughputonmulticoreserversevenwhenconsumingonlyasingleTCPport.

Toconfigurethissource,setthetypepropertytomultiport_syslogtcp:

agent.sources.s1.type=multiport_syslogtcp

Liketheothersyslogsources,youneedtospecifytheport,butinthiscaseitisaspace-separatedlistofports.Youcanusethisonlyifyouhaveoneportspecified.Thepropertyforthisisports(plural):

agent.sources.s1.type=multiport_syslogtcp

agent.sources.s1.channels=c1

agent.sources.s1.ports=3333344444

agent.sources.s1.host=0.0.0.0

ThiscodeconfigurestheMultiportSyslogTCPsourcenameds1tolistentoanyincomingconnectionsonports33333and44444andsendthemtochannelc1.

Inordertotellwhicheventcamefromwhichport,youcansettheoptionalportHeaderpropertytothenameofthekeywhosevaluewillbetheportnumber.Let’saddthispropertytotheconfiguration:

agent.sources.s1.portHeader=port

Then,anyeventsreceivedfromport33333wouldhaveaheaderkey/valueof{"port"="33333"}.AsyousawinChapter4,SinksandSinkProcessors,youcannowusethisvalue(oranyheader)asapartofyourHDFSSinkfilepathconvention,likethis:

agent.sinks.k1.hdfs.path=/logs/%{hostname}/%{port}/%Y/%m/%D/%H

Hereisacompletetableoftheproperties:

Key Required Type Default

type Yes String syslogtcp

channels Yes String Space-separatedlistofchannels

ports Yes int Space-separatedlistofportnumbers

host No String 0.0.0.0

keepFields No boolean false

Page 147: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

eventSize No int 2500(bytes)

portHeader No String

batchSize No int 100

readBufferSize No int(bytes) 1024

numProcessors No int automaticallydetected

charset.default No String UTF-8

charset.port.PORT# No String

ThisTCPsourcehassomeadditionaltunableoptionsoverthestandardTCPsyslogsource.ThefirstisthebatchSizeproperty.Thisisthenumberofeventsprocessedpertransactionwiththechannel.ThereisalsothereadBufferSizeproperty.ItspecifiestheinternalbuffersizeusedbyaninternalMinalibrary.Finally,thenumProcessorspropertyisusedtosizetheworkerthreadpoolinMina.Beforeyoutunetheseparameters,youmaywanttofamiliarizeyourselfwithMina(http://mina.apache.org/),andlookatthesourcecodebeforedeviatingfromthedefaults.

Finally,youcanspecifythedefaultandper-portcharacterencodingtousewhenconvertingbetweenStringsandbytes:

agent.sources.s1.charset.default=UTF-16

agent.sources.s1.charset.port.33333=UTF-8

ThissampleconfigurationshowsthatallportswillbeinterpretedusingUTF-16encoding,exceptforport33333traffic,whichwilluseUTF-8.

Asyou’vealreadyseenintheothersyslogsources,thekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbythissourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.OmittedifnotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Omittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisissettoInvalidifthepayloaddidn’tconformtotheRFCs,settoIncompleteifthemessagewaslongerthantheconfiguredeventSize,andomittedifeverythingisfine.

Page 148: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 149: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

JMSsourceSometimes,datacanoriginatefromasynchronousmessagequeues.Forthesecases,youcanuseFlume’sJMSsourcetocreateeventsreadfromaJMSQueueorTopic.WhileitistheoreticallypossibletouseanyJavaMessageService(JMS)implementation,FlumehasonlybeentestedwithActiveMQ,sobesuretotestthoroughlyifyouuseadifferentprovider.

LikethepreviouslycoveredElasticSearchsinkinChapter4,SinksandSinkProcessors,FlumedoesnotcomepackagedwiththeJMSimplementationyou’llbeusing,astheversionsneedtomatchup,soyou’llneedtoincludethenecessaryimplementationJARfilesontheFlumeagent’sclasspath.Thepreferredmethodistousethe--plugins-dirparametermentionedinChapter2,AQuickStartGuidetoFlume,whichwe’llcoverinmoredetailinthenextchapter.

NoteActiveMQisjustoneprovideroftheJavaMessageServiceAPI.Formoreinformation,seetheprojecthomepageathttp://activemq.apache.org.

Toconfigurethissource,setthetypepropertytojms:

agent.sources.s1.type=jms

ThefirstthreepropertiesareusedtoestablishaconnectionwiththeJMSServer.TheyareinitialContextFactory,connectionFactory,andproviderURL.TheinitialContextFactorypropertyforActiveMQwillbeorg.apache.activemq.jndi.ActiveMQInitialContextFactory.TheconnectionFactorypropertywillbetheregisteredJNDIname,whichdefaultstoConnectionFactoryifunspecified.Finally,providerURListheconnectionStringpassedtotheconnectionfactorytoactuallyestablishanetworkconnection.ItisusuallyaURL-likeStringconsistingoftheservernameandportinformation.Ifyouaren’tfamiliarwithyourJMSconfiguration,asksomebodywhodoeswhatvaluestouseinyourenvironment.

Thistablesummarizesthesesettingsandotherswe’lldiscussinamoment:

Key Required Type Default

type Yes String jms

channels Yes String spaceseparatedlistofchannels

initialContextFactory Yes String

connectionFactory No String ConnectionFactory

providerURL Yes String

userName No String usernameforauthentication

Page 150: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

passwordFile No String pathtofilecontainingpasswordforauthentication

destinationName Yes String

destinationType Yes String

messageSelector No String

errorThreshold No int 10

pollTimeout No long 1000(milliseconds)

batchSize No int 100

IfyourJMSServerrequiresauthentication,passtheuserNameandpasswordFileproperties:

agent.sources.s1.userName=jms_bot_user

agent.sources.s1.passwordFile=/path/to/password.txt

Puttingthepasswordinafileyoureference,ratherthandirectlyintheFlumeconfiguration,allowsyoutokeepthepermissionsontheconfigurationopenforinspection,whilestoringthemoresensitivepassworddatainaseparatefilethatisaccessibleonlytotheFlumeagent(restrictionsarecommonlyprovidedbytheoperatingsystem’spermissionssystem).

ThedestinationNamepropertydecideswhichmessagetoreadfromourconnection.SinceJMSsupportsbothqueuesandtopics,youneedtosetthedestinationTypepropertytoqueueortopic,respectively.Here’swhatthepropertiesmightlooklikeforaqueue:

agent.source.s1.destinationName=my_cool_data

agent.source.s1.destinationType=queue

Foratopic,itwilllookasfollows:

agent.source.s1.destinationName=restart_events

agent.source.s1.destinationType=topic

Thedifferencebetweenaqueueandatopicisbasicallythenumberofentitiesthatwillbereadfromthenameddestination.Ifyouarereadingfromaqueue,themessagewillberemovedfromthequeueonceitiswrittentoFlumeasanevent.Atopic,onceread,isstillavailableforotherentitiestoread.

Shouldyouneedonlyasubsetofmessagespublishedtoatopicorqueue,theoptionalmessageSelectorStringpropertyprovidesthiscapability.Forexample,tocreateFlumeeventsonmessageswithafieldcalledAgethatislargerthan10,IcanspecifyamessageSelectorfilter:

agent.source.s1.messageSelector="Age>10"

NoteDescribingtheselectorcapabilitiesandsyntaxindetailisfarbeyondthescopeofthis

Page 151: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

book.Seehttp://docs.oracle.com/javaee/1.4/api/javax/jms/Message.htmlorpickupabookonJMStobecomemorefamiliarwithJMSmessageselectors.

ShouldtherebeaproblemincommunicatingwiththeJMSserver,theconnectionwillresetitselfafteranumberoffailuresspecifiedbytheerrorThresholdproperty.Thedefaultvalueof10isreasonableandwillmostlikelynotneedtobechanged.

Next,theJMSsourcehasapropertyusedtoadjustthevalueofbatchSize,whichdefaultsto100.Bynow,youshouldbefairlyfamiliarwithadjustingbatchsizeswithothersourcesandsinks.Forhigh-volumeflows,you’llwanttosetthishighertoconsumedatainlargerchunksfromyourJMSservertogethigherthroughput.Settingthistoohighforlowvolumeflowscoulddelayprocessing.Asalways,testingistheonlysurewaytoadjustthisproperly.TherelatedpollTimeoutpropertyspecifieshowlongtowaitfornewmessagestoappearbeforeattemptingtoreadabatch.Thedefault,specifiedinmilliseconds,is1second.Ifnomessagesarereadbeforethepolltimeout,theSourcewillgointoanexponentialbackoffmodebeforeattemptinganotherread.Thiscoulddelaytheprocessingofmessagesuntilitwakesupagain.Chancesarethatyouwon’tneedtochangethisvaluefromthedefault.

WhenthemessagegetsconvertedintoaFlumeevent,themessagepropertiesbecomeFlumeheaders.ThemessagepayloadbecomestheFlumebody,butsinceJMSusesserializedJavaobjects,weneedtotelltheJMSsourcehowtointerpretthepayload.Wedothisbysettingtheconverter.typeproperty,whichdefaultstotheonlyimplementationthatispackagedwithFlumeusingtheDEFAULTString(implementedbytheDefaultJMSMessageConvererclass).ItcandeserializeJMSBytesMessages,TextMessages,andObjectMessages(Javaobjectsthatimplementthejava.io.DataOutputinterface).ItdoesnothandleStreamMessagesorMapMessages,soifyouneedtoprocessthem,you’llneedtoimplementyourownconvertertype.Todothis,you’llimplementtheorg.apache.flume.source.jms.JMSMessageConverterinterface,anduseitsfullyqualifiedclassnameastheconverter.typepropertyvalue.

Forcompleteness,herearethesepropertiesintabularform:

Key Required Type Default

converter.type No String DEFAULT

converter.charset No String UTF-8

Page 152: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 153: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ChannelselectorsAswediscussedinChapter1,OverviewandArchitecture,asourcecanwritetooneormorechannels.Thisiswhythepropertyisplural(channelsinsteadofchannel).Therearetwowaysmultiplechannelscanbehandled.Theeventcanbewrittentoallthechannelsortojustonechannel,basedonsomeFlumeheadervalue.TheinternalmechanismforthisinFlumeiscalledachannelselector.

Theselectorforanychannelcanbespecifiedusingtheselector.typeproperty.Allselector-specificpropertiesbeginwiththeusualSourceprefix:theagentname,keywordsources,andsourcename:

agent.sources.s1.selector.type=replicating

Page 154: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ReplicatingIfyoudonotspecifyaselectorforasource,replicatingisthedefault.Thereplicatingselectorwritesthesameeventtoallchannelsinthesource’schannelslist:

agent.sources.s1.channels=c1c2c3

agent.sources.s1.selector.type=replicating

Inthisexample,everyeventwillbewrittentoallthreechannels:c1,c2,andc3.

Thereisanoptionalpropertyonthisselector,calledoptional.Itisaspace-separatedlistofchannelsthatareoptional.Considerthismodifiedexample:

agent.sources.s1.channels=c1c2c3

agent.sources.s1.selector.type=replicating

agent.sources.s1.selector.optional=c2c3

Now,anyfailuretowritetochannelsc2orc3willnotcausethetransactiontofail,andanydatawrittentoc1willbecommitted.Intheearlierexamplewithnooptionalchannels,anysinglechannelfailurewouldrollbackthetransactionforallchannels.

Page 155: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MultiplexingIfyouwanttosenddifferenteventstodifferentchannels,youshoulduseamultiplexingchannelselectorbysettingthevalueofselector.typetomultiplexing.Youalsoneedtotellthechannelselectorwhichheadertousebysettingtheselector.headerproperty:

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=port

Let’sassumeweusedtheMultiportSyslogTCPsourcetolistenonfourports—11111,22222,33333,and44444—withaportHeadersettingofport:

agent.sources.s1.selector.default=c2

agent.sources.s1.selector.mapping.11111=c1c2

agent.sources.s1.selector.mapping.44444=c2

agent.sources.s1.selector.optional.44444=c3

Thisconfigurationwillresultinthetrafficofport22222andport33333goingtothec2channelonly.Thetrafficofport11111willgotothec1andc2channels.Afailureoneitherchannelwouldresultinnothingbeingaddedtoeitherchannel.Thetrafficofport44444willgotochannelsc2andc3.However,afailuretowritetoc3willstillcommitthetransactiontoc2,andc3willnotbeattemptedagainwiththatevent.

Page 156: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 157: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredindepththevarioussourcesthatwecanusetoinsertlogdataintoFlume,includingtheExecsource,theSpoolingDirectorySource,Syslogsources(UDP,TCP,andmultiportTCP),andtheJMSsource.

WediscussedreplicatingtheoldTailSourcefunctionalityinFlume0.9andproblemswithusingtailsemanticsingeneral.

Wealsocoveredchannelselectorsandsendingeventstooneormorechannels,specificallythereplicatingandmultiplexingchannelselectors.

Optionalchannelswerealsodiscussedasawaytoonlyfailaputtransactionforonlysomeofthechannelswhenmorethanonechannelisused.

Inthenextchapter,we’llintroduceinterceptorsthatwillallowin-flightinspectionandtransformationofevents.Usedinconjunctionwithchannelselectors,interceptorsprovidethefinalpiecetocreatecomplexdataflowswithFlume.Additionally,wewillcoverRPCmechanisms(source/sinkpairs)betweenFlumeagentsusingbothAvroandThrift,whichcanbeusedtocreatecomplexdataflows.

Page 158: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 159: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter6.Interceptors,ETL,andRoutingThefinalpieceoffunctionalityrequiredinyourdataprocessingpipelineistheabilitytoinspectandtransformeventsinflight.Thiscanbeaccomplishedusinginterceptors.Interceptors,aswediscussedinChapter1,OverviewandArchitecture,canbeinsertedafterasourcecreatesanevent,butbeforewritingtothechanneloccurs.

Page 160: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

InterceptorsAninterceptor’sfunctionalitycanbesummedupwiththismethod:

publicEventintercept(Eventevent);

AFlumeeventispassedtoit,anditreturnsaFlumeevent.Itmaydonothing,inwhichcase,thesameunalteredeventisreturned.Often,italterstheeventinsomeusefulway.Ifnullisreturned,theeventisdropped.

Toaddinterceptorstoasource,simplyaddtheinterceptorspropertytothenamedsource,forexample:

agent.sources.s1.interceptors=i1i2i3

Thisdefinesthreeinterceptors:i1,i2,andi3onthes1sourcefortheagentnamedagent.

NoteInterceptorsarerunintheorderinwhichtheyarelisted.Intheprecedingexample,i2willreceivetheoutputfromi1.Then,i3willreceivetheoutputfromi2.Finally,thechannelselectorreceivestheoutputfromi3.

Nowthatwehavedefinedtheinterceptorbyname,weneedtospecifyitstypeasfollows:

agent.sources.s1.interceptors.i1.type=TYPE1

agent.sources.s1.interceptors.i1.additionalProperty1=VALUE

agent.sources.s1.interceptors.i2.type=TYPE2

agent.sources.s1.interceptors.i3.type=TYPE3

Let’slookatsomeoftheinterceptorsthatcomebundledwithFlumetogetabetterideaofhowtoconfigurethem.

Page 161: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TimestampTheTimestampinterceptor,asitsnamesuggests,addsaheaderwiththetimestampkeytotheFlumeeventifonedoesn’talreadyexist.Touseit,setthetypepropertytotimestamp.

Iftheeventalreadycontainsatimestampheader,itwillbeoverwrittenwiththecurrenttimeunlessconfiguredtopreservetheoriginalvaluebysettingthepreserveExistingpropertytotrue.

HereisatablesummarizingthepropertiesoftheTimestampinterceptor:

Key Required Type Default

type Yes String timestamp

preserveExisting No boolean false

Hereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddatimestampheaderifnoneexists:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.i1.preserveExisting=true

RecallthisHDFSSinkpathfromChapter4,SinksandSinkProcessors,utilizingtheeventdate:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

Thetimestampheaderiswhatdeterminesthispath.Ifitismissing,youcanbesureFlumewillnotknowwheretocreatethefiles,andyouwillnotgettheresultyouarelookingfor.

Page 162: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

HostSimilarinsimplicitytotheTimestampinterceptor,theHostinterceptorwilladdaheadertotheeventcontainingtheIPaddressofthecurrentFlumeagent.Touseit,setthetypepropertytohost:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.type=host

ThekeyforthisheaderwillbehostunlessyouspecifysomethingelseusingthehostHeaderproperty.Likebefore,anexistingheaderwillbeoverwritten,unlessyousetthepreserveExistingpropertytotrue.Finally,ifyouwantareverseDNSlookupofthehostnametobeusedinsteadoftheIPasavalue,settheuseIPpropertytofalse.Rememberthatreverselookupswilladdprocessingtimetoyourdataflow.

HereisatablesummarizingthepropertiesoftheHostinterceptor:

Key Required Type Default

type Yes String host

hostHeader No String host

preserveExisting No boolean false

useIP No boolean true

HereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddarelayHostheadercontainingtheDNShostnameofthisagenttoeveryevent:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=host

agent.sources.s1.interceptors.i1.hostHeader=relayHost

agent.sources.s1.interceptors.i1.useIP=false

Thisinterceptormightbeusefulifyouwantedtorecordthepathyoureventstookthoughyourdataflow,forinstance.Chancesareyouaremoreinterestedintheoriginoftheeventratherthanthepathittook,whichiswhyIhaveyettousethis.

Page 163: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

StaticTheStaticinterceptorisusedtoinsertasinglekey/valueheaderintoeachFlumeeventprocessed.Ifmorethanonekey/valueisdesired,yousimplyaddadditionalStaticinterceptors.Unliketheinterceptorswe’velookedatsofar,thedefaultbehavioristopreserveexistingheaderswiththesamekey.Asalways,myrecommendationistoalwaysspecifywhatyouwantandnotrelyonthedefaults.

Idonotknowwhythekeyandvaluepropertiesarenotrequired,asthedefaultsarenotterriblyuseful.

HereisatablesummarizingthepropertiesoftheStaticinterceptor:

Key Required Type Default

type Yes String static

key No String key

value No String value

preserveExisting No boolean true

Finally,let’slookatanexampleconfigurationthatinsertstwonewheaders,providedtheydon’talreadyexistintheevent:

agent.sources.s1.interceptors=posenv

agent.sources.s1.interceptors.pos.type=static

agent.sources.s1.interceptors.pos.key=pointOfSale

agent.sources.s1.interceptors.pos.value=US

agent.sources.s1.interceptors.env.type=static

agent.sources.s1.interceptors.env.key=environment

agent.sources.s1.interceptors.env.value=staging

Page 164: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

RegularexpressionfilteringIfyouwanttofiltereventsbasedonthecontentofthebody,theregularexpressionfilteringinterceptorisyourfriend.Basedonaregularexpressionyouprovide,itwilleitherfilteroutthematchingeventsorkeeponlythematchingevents.Startbysettingthetypeinterceptortoregex_filter.ThepatternyouwanttomatchisspecifiedusingaJava-styleregularexpressionsyntax.Seethesejavadocsforusagedetailsathttp://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.Thepatternstringissetintheregexproperty.BesuretoescapebackslashesinJavaStrings.Forinstance,the\d+patternwouldneedtobewrittenwithtwobackslashes:\\d+.Ifyouwantedtomatchabackslash,thedocumentationsaystotypetwobackslashes,buteachneedstobeescaped,resultinginfour,thatis,\\\\.Youwillseetheuseofescapedbackslashesthroughoutthischapter.Finally,youneedtotelltheinterceptorifyouwanttoexcludematchingrecordsbysettingtheexcludeEventspropertytotrue.Thedefault(false)indicatesthatyouwanttoonlykeepeventsthatmatchthepattern.

Hereisatablesummarizingthepropertiesoftheregularexpressionfilteringinterceptor:

Key Required Type Default

type Yes String regex_filter

regex No String .*

excludeEvents No boolean false

Inthisexample,anyeventscontainingtheNullPointerExceptionstringwillbedropped:

agent.sources.s1.interceptors=npe

agent.sources.s1.interceptors.npe.type=regex_filter

agent.sources.s1.interceptors.npe.regex=NullPointerException

agent.sources.s1.interceptors.npe.excludeEvents=true

Page 165: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

RegularexpressionextractorSometimes,you’llwanttoextractbitsofyoureventbodyintoFlumeheaderssothatyoucanperformroutingviaChannelSelectors.Youcanusetheregularexpressionextractorinterceptortoperformthisfunction.Startbysettingthetypeinterceptortoregex_extractor:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

Liketheregularexpressionfilteringinterceptor,theregularexpressionextractorinterceptortoousesaJava-styleregularexpressionsyntax.Inordertoextractoneormorefields,youstartbyspecifyingtheregexpropertywithgroupmatchingparentheses.Let’sassumewearelookingforerrornumbersinoureventsintheError:Nform,whereNisanumber:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

Asyoucansee,Iputcaptureparenthesesaroundthenumber,whichmaybeoneormoredigit.NowthatI’vematchedmydesiredpattern,IneedtotellFlumewhattodowithmymatch.Here,weneedtointroduceserializers,whichprovideapluggablemechanismforhowtointerpreteachmatch.Inthisexample,I’veonlygotonematch,somyspace-separatedlistofserializernameshasonlyoneentry:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

agent.sources.s1.interceptors.e1.serializers=ser1

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

Thenamepropertyspecifiestheeventkeytouse,wherethevalueisthematchingtextfromtheregularexpression.Thetypeofdefaultvalue(alsothedefaultifnotspecified)isasimplepass-throughserializer.Forthiseventbody,lookatthefollowing:

NullPointerException:Aproblemoccurred.Error:123.TxnID:5X2T9E.

Thefollowingheaderwouldbeaddedtotheevent:

{"error_no":"123"}

IfIwantedtoaddtheTxnIDvalueasaheader,I’dsimplyaddanothermatchingpatterngroupandserializer:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+).*TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e1.serializers=ser1ser2

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

agent.sources.s1.interceptors.e1.serializers.ser2.type=default

Page 166: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

agent.sources.s1.interceptors.e1.serializers.ser2.name=txnid

Then,Iwouldcreatetheseheadersfortheprecedinginput:

{"error_no":"123","txnid":"5x2T9E"}

However,takealookatwhatwouldhappenifthefieldswerereversedasfollows:

NullPointerException:Aproblemoccurred.TxnID:5X2T9E.Error:123.

Iwouldwindupwithonlyaheaderfortxnid.Abetterwaytohandlethiskindoforderingwouldbetousemultipleinterceptorssothattheorderdoesn’tmatter:

agent.sources.s1.interceptors=e1e2

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

agent.sources.s1.interceptors.e1.serializers=ser1

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

agent.sources.s1.interceptors.e2.type=regex_extractor

agent.sources.s1.interceptors.e2.regex=TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e2.serializers=ser1

agent.sources.s1.interceptors.e2.serializers.ser1.type=default

agent.sources.s1.interceptors.e2.serializers.ser1.name=txnid

TheonlyothertypeofserializerimplementationthatshipswithFlume,otherthanthepass-through,istospecifythefullyqualifiedclassnameoforg.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer.Thisserializerisusedtoconverttimesintomilliseconds.Youneedtospecifyapatternpropertybasedonorg.joda.time.format.DateTimeFormatpatterns.

Forinstance,let’ssayyouwereingestingApacheWebServeraccesslogs,forexample:

192.168.1.42--[29/Mar/2013:15:27:09-0600]"GET/index.htmlHTTP/1.1"

2001037

Thecompleteregularexpressionforthismightlooklikethis(intheformofaJavaString,withbackslashandquotesescapedwithanextrabackslash):

^([\\d.]+)\\S+\\S+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})

(\\d+)

Thetimepatternmatchedcorrespondstotheorg.joda.time.format.DateTimeFormatpattern:

yyyy/MMM/dd:HH:mm:ssZ

Takealookatwhatwouldhappenifwemakeourconfigurationsomethinglikethis:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

agent.sources.s1.interceptors.e1.serializers=ipdturlscbc

agent.sources.s1.interceptors.e1.serializers.ip.name=ip_address

agent.sources.s1.interceptors.e1.serializers.dt.type=org.apache.flume.inter

ceptor.RegexExtractorInterceptorMillisSerializer

Page 167: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

agent.sources.s1.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:HH:mm:s

sZ

agent.sources.s1.interceptors.e1.serializers.dt.name=timestamp

agent.sources.s1.interceptors.e1.serializers.url.name=http_request

agent.sources.s1.interceptors.e1.serializers.sc.name=status_code

agent.sources.s1.interceptors.e1.serializers.bc.name=bytes_xfered

Thiswouldcreatethefollowingheadersfortheprecedingsample:

{"ip_address":"192.168.1.42","timestamp":"1364588829",

"http_request":"GET/index.htmlHTTP/1.1","status_code":"200",

"bytes_xfered":"1037"}

Thebodycontentisunaffected.You’llalsonoticethatIdidn’tspecifydefaultfortheothertypeofserializers,asthatisthedefault.

NoteThereisnooverwritecheckinginthisinterceptortype.Forinstance,usingthetimestampkeywilloverwritetheevent’sprevioustimevalueiftherewasone.

Youcanimplementyourownserializersforthisinterceptorbyimplementingtheorg.apache.flume.interceptor.RegexExtractorInterceptorSerializerinterface.However,ifyourgoalistomovedatafromthebodyofaneventtotheheader,you’llprobablywanttoimplementacustominterceptorsothatyoucanalterthebodycontentsinadditiontosettingtheheadervalue,otherwisethedatawillbeeffectivelyduplicated.

Tosummarize,let’sreviewthepropertiesforthisinterceptor:

Key Required Type Default

type Yes String regex_extractor

regex Yes String

serializers Yes Space-separatedlistofserializernames

serializers.NAME.name Yes String

serializers.NAME.type No DefaultorFQCNofimplementation default

serializers.NAME.PROP No Serializer-specificproperties

Page 168: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MorphlineinterceptorAswesawinChapter4,SinksandSinkProcessors,apowerfullibraryoftransformationsbackstheMorphlineSolrSinkfromtheKiteSDKproject.Itshouldcomeasnosurprisethatyoucanalsousetheselibrariesinmanyplaceswhereyou’dbeforcedtowriteacustominterceptor.Similarinconfigurationtoitssinkcounterpart,youonlyneedtospecifytheMorphlineconfigurationfile,andoptionally,theMorphlineuniqueidentifier(iftheconfigurationspecifiesmorethanoneMorphline).HereisasummarytableoftheFlumeinterceptorconfiguration:

Key Required Type Default

type Yes String org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder

morphlineFile Yes String

morphlineId No String Picksthefirstifnotspecifiedandanerrorifnotspecified;morethanoneexists.

YourFlumeconfigurationmightlooksomethinglikethis:

agent.sources.s1.interceptors=i1m1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.m1.type=org.apache.flume.sink.solr.morphline.

MorphlineInterceptor$Builder

agent.sources.s1.interceptors.m1.morphlineFile=/path/to/morph.conf

agent.sources.s1.interceptors.m1.morphlineId=goMorphy

Inthisexample,wehavespecifiedthes1sourceontheagentnamedagent,whichcontainstwointerceptors,i1andm1,processedinthatorder.Thefirstinterceptorisastandardinterceptorthatinsertsatimestampheaderifnoneexists.ThesecondwillsendtheeventthroughtheMorphlineprocessorspecifiedbytheMorphlineconfigurationfilecorrespondingtothegoMorphyID.

Eventsprocessedasaninterceptormustonlyoutputoneeventforeveryinputevent.IfyouneedtooutputmultiplerecordsfromasingleFlumeevent,youmustdothisintheMorphlineSolrSink.Withinterceptors,itisstrictlyoneeventinandoneeventout.

ThefirstcommandwillmostlikelybethereadLinecommand(orreadBlob,readCSV,andsoon)toconverttheeventintoaMorphlineRecord.Fromthere,yourunanyotherMorphlinecommandsyoulikewiththefinalRecordattheendoftheMorphlinechainbeingconvertedbackintoaFlumeevent.WeknowfromChapter4,SinksandSinkProcessors,thatthe_attachment_bodyspecialRecordkeyshouldbebyte[](thesameastheevent’sbody).Thismeansthatthelastcommandneedstodotheconversion(suchastoByteArrayorwriteAvroToByteArray).AllotherRecordkeysareconvertedtoStringvaluesandsetasheaderswiththesamekeys.

NoteTakealookatthereferenceguideforacompletelistofMorphlinecommands,theirpropertiesandusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Page 169: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheMorphlineconfigurationfilesyntaxhasalreadybeendiscussedindetailintheMorphlineconfigurationfilesectioninChapter4,SinksandSinkProcessors.

Page 170: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CustominterceptorsIfthereisonepieceofcustomcodeyouwilladdtoyourFlumeimplementation,itwillmostlikelybeacustominterceptor.Asmentionedearlier,youimplementtheorg.apache.flume.interceptor.Interceptorinterfaceandtheassociatedorg.apache.flume.interceptor.Interceptor.Builderinterface.

Let’ssayIneededtoURLCodemyeventbody.Thecodewouldlooksomethinglikethis:

publicclassURLDecodeimplementsInterceptor{

publicvoidinitialize(){}

publicEventintercept(Eventevent){

try{

byte[]decoded=URLDecoder.decode(newString(event.getBody()),"UTF-

8").getBytes("UTF-8");

event.setBody(decoded);

}catchUnsupportedEncodingExceptione){

//Shouldn'thappen.Fallthroughtounalteredevent.

}

returnevent;

}

publicList<Event>intercept(List<Event>events){

for(Eventevent:events){

intercept(event);

}

returnevents;

}

publicvoidclose(){}

publicstaticclassBuilderimplementsInterceptor.Builder{

publicInterceptorbuild(){

returnnewURLDecode();

}

publicvoidconfigure(Contextcontext){}

}

}

Then,toconfiguremynewinterceptor,usethefullyqualifiedclassnamefortheBuilderclassasthetype:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=com.example.URLDecoder$Builder

Formoreexamplesofhowtopassandvalidateproperties,lookatanyoftheexistinginterceptorimplementationsintheFlumesourcecode.

Keepinmindthatanyheavyprocessinginyourcustominterceptorcanaffecttheoverallthroughput,sobemindfulofobjectchurnorcomputationallyintensiveprocessinginyourimplementations.

Page 171: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ThepluginsdirectoryCustomcode(sources,interceptors,andsoon)canalwaysbeinstalledalongsidetheFlumecoreclassesinthe$FLUME_HOME/libdirectory.YoucouldalsospecifyadditionalpathsforCLASSPATHonstartupbywayoftheflume-env.shshellscript(whichisoftensourcedatstartuptimewhenusingapackageddistributionofFlume).StartingwithFlume1.4,thereisnowacommand-lineoptiontospecifyadirectorythatcontainsthecustomcode.Bydefault,ifnotspecified,the$FLUME_HOME/plugins.ddirectoryisused,butyoucanoverridethisusingthe--plugins-pathcommand-lineparameter.

Withinthisdirectory,eachpieceofcustomcodeisseparatedintoasubdirectorywhosenameisofyourchoosing—picksomethingeasyforyoutokeeptrackofthings.Inthisdirectory,youcanincludeuptothreesubdirectories:lib,libext,andnative.

ThelibdirectoryshouldcontaintheJARfileforyourcustomcomponent.Anydependenciesitusesshouldbeaddedtothelibextsubdirectory.BothofthesepathsareaddedtotheJavaCLASSPATHvariableatstartup.

TipTheFlumedocumentationimpliesthatthisdirectoryseparationbycomponentallowsforconflictingJavalibrariestocoexist.Intruth,thisisnotpossibleunlesstheunderlyingimplementationmakesuseofdifferentclassloaderswithintheJVM.Inthiscase,theFlumestartupcodesimplyappendsallofthesepathstothestartupCLASSPATHvariable,sotheorderinwhichthesubdirectoriesareprocessedwilldeterminetheprecedence.Youcannotevenbeguaranteedthatthesubdirectorieswillbeprocessedinalexicographicorder,astheunderlyingbashshellforloopcangivenosuchguarantees.Inpractice,youshouldalwaystryandavoidconflictingdependencies.Codethatdependstoomuchonorderingtendstobebuggy,especiallyifyourclasspathreordersitselffromserverinstallationtoserverinstallation.

Thethirddirectory,native,iswhereyouputanynativelibrariesassociatedwithyourcustomcomponent.Thispath,ifitexists,getsaddedtoLD_LIBRARY_PATHatstartupsothattheJVMcanlocatethesenativecomponents.

So,ifIhadthreecustomcomponents,mydirectorystructuremightlooksomethinglikethis:

$FLUME_HOME/plugins.d/base64-enc/lib/base64interceptor.jar

$FLUME_HOME/plugins.d/base64-enc/libext/base64-2.0.0.jar

$FLUME_HOME/plugins.d/base64-enc/native/libFoo.so

$FLUME_HOME/plugins.d/uuencode/lib/uuEncodingInterceptor.jar

$FLUME_HOME/plugins.d/my-avro-serializer/myAvroSerializer.jar

$FLUME_HOME/plugins.d/my-avro-serializer/native/libAvro.so

Keepinmindthatallthesepathsgetmashedtogetheratstartup,soconflictingversionsoflibrariesshouldbeavoided.Thisstructureisanorganizationmechanismforeaseofdeploymentforhumans.Personally,Ihavenotusedityet,asIuseChef(orPuppet)toinstallcustomcomponentsonmyFlumeagentsdirectlyinto$FLUME_HOME/lib,butifyouprefertousethismechanism,itisavailable.

Page 172: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 173: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TieringflowsInChapter1,OverviewandArchitecture,wetalkedabouttieringyourdataflows.Thereareseveralreasonsforyoutowanttodothis.YoumaywanttolimitthenumberofFlumeagentsthatdirectlyconnecttoyourHadoopcluster,tolimitthenumberofparallelrequests.YoumayalsolacksufficientdiskspaceonyourapplicationserverstostoreasignificantamountofdatawhileyouareperformingmaintenanceonyourHadoopcluster.Whateveryourreasonorusecase,themostcommonmechanismtochainFlumeagentsistousetheAvrosource/sinkpair.

Page 174: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheAvrosource/sinkWecoveredAvroabitinChapter4,SinksandSinkProcessors,whenwediscussedhowtouseitasanon-diskserializationformatforfilesstoredinHDFS.Here,we’llputittouseincommunicationbetweenFlumeagents.Atypicalconfigurationmightlooksomethinglikethis:

TousetheAvrosource,youspecifythetypepropertywithavalueofavro.Youneedtoprovideabindaddressandportnumbertolistenon:

collector.sources=av1

collector.sources.av1.type=avro

collector.sources.av1.bind=0.0.0.0

collector.sources.av1.port=42424

collector.sources.av1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Page 175: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Here,wehaveconfiguredtheagentinthemiddlethatlistensonport42424,usesamemorychannel,andwritestoHDFS.I’veusedthememorychannelforbrevityinthisexampleconfiguration.AlsonotethatI’vegiventhisagentadifferentname,collector,justtoavoidconfusion.

Theagentsonthetopandbottomsidesfeedingthecollectortiermighthaveaconfigurationsimilartothis.Ihaveleftthesourcesoffthisconfigurationforbrevity:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1

client.sinks.k1.type=avro

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collector.example.com

client.sinks.k1.port=42424

Thehostname,collector.example.com,hasnothingtodowiththeagentnameonthismachine;itisthehostname(oryoucanuseanIP)ofthetargetmachinewiththereceivingAvrosource.Thisconfiguration,namedclient,wouldbeappliedtobothagentsonthetopandbottomsides,assumingbothhadsimilarsourceconfigurations.

AsIdon’tlikesinglepointsoffailure,Iwouldconfiguretwocollectoragentswiththeprecedingconfigurationandinstead,seteachclientagenttoroundrobinbetweenthetwo,usingasinkgroup.Again,I’veleftoffthesourcesforbrevity:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1k2

client.sinks.k1.type=avro

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collectorA.example.com

client.sinks.k1.port=42424

client.sinks.k2.type=avro

client.sinks.k2.channel=ch1

client.sinks.k2.hostname=collectorB.example.com

client.sinks.k2.port=42424

client.sinkgroups=g1

client.sinkgroups.g1=k1k2

client.sinkgroups.g1.processor.type=load_balance

client.sinkgroups.g1.processor.selector=round_robin

client.sinkgroups.g1.processor.backoff=true

TherearefouradditionalpropertiesassociatedwiththeAvrosinkthatyoumayneedtoadjustfromtheirsensibledefaults.

Thefirstisthebatch-sizeproperty,whichdefaultsto100.Inheavyloads,youmayseebetterthroughputbysettingthishigher,forexample:

client.sinks.k1.batch-size=1024

Thenexttwopropertiescontrolnetworkconnectiontimeouts.Theconnect-timeoutproperty,whichdefaultsto20seconds(specifiedinmilliseconds),istheamountoftimerequiredtoestablishaconnectionwithanAvrosource(receiver).Therelatedrequest-timeoutproperty,whichalsodefaultsto20seconds(specifiedinmilliseconds),isthe

Page 176: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

amountoftimeforasentmessagetobeacknowledged.IfIwantedtoincreasethesevaluesto1minute,Icouldaddtheseadditionalproperties:

client.sinks.k1.connect-timeout=60000

client.sinks.k1.request-timeout=60000

Finally,thereset-connection-intervalpropertycanbesettoforceconnectionstoreestablishthemselvesaftersometimeperiod.ThiscanbeusefulwhenyoursinkisconnectingthroughaVIP(VirtualIPAddress)orhardwareloadbalancertokeepthingsbalanced,asservicesofferedbehindtheVIPmaychangeovertimeduetofailuresorchangesincapacity.Bydefault,theconnectionswillnotberesetexceptincasesoffailure.Ifyouwantedtochangethissothattheconnectionsresetthemselveseveryhour,forexample,youcanspecifythisbysettingthispropertywiththenumberofseconds,asfollows:

client.sinks.k1.reset-connection-interval=3600

CompressingAvroCommunicationbetweentheAvrosourceandsinkcanbecompressedbysettingthecompression-typepropertytodeflate.OntheAvrosink,youcanadditionally,setthecompression-levelpropertytoanumberbetween1and9,withthedefaultbeing6.Typically,therearediminishingreturnsathighercompressionlevels,buttestingmayproveanondefaultvaluethatworksbetterforyou.Clearly,youneedtoweightheadditionalCPUcostsagainsttheoverheadtoperformahigherlevelofcompression.Typically,thedefaultisfine.

Compressedcommunicationsareespeciallyimportantwhenyouaredealingwithhighlatencynetworks,suchassendingdatabetweentwodatacenters.Anothercommonusecaseforcompressioniswhereyouarechargedforthebandwidthconsumed,suchasmostpubliccloudservices.Inthesecases,youwillprobablychoosetospendCPUcycles,compressinganddecompressingyourflowratherthansendinghighlycompressibledatauncompressed.

TipItisveryimportantthatifyousetthecompression-typepropertyinasource/sinkpair,yousetthispropertyatbothends.Otherwise,asinkcouldbesendingdatathesourcecan’tconsume.

Continuingtheprecedingexample,toaddcompression,youwouldaddtheseadditionalpropertyfieldsonbothagents:

collector.sources.av1.compression-type=deflate

client.sinks.k1.compression-type=deflate

client.sinks.k1.compression-level=7

client.sinks.k2.compression-type=deflate

client.sinks.k2.compression-level=7

SSLAvroflowsSometimes,communicationisofasensitivenatureormaytraverseuntrustednetwork

Page 177: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

paths.YoucanencryptyourcommunicationsbetweenanAvrosinkandanAvrosourcebysettingthesslpropertytotrue.Additionally,youwillneedtopassSSLcertificateinformationtothesource(thereceiver)andoptionally,thesink(thesender).

Iwon’tclaimtobeanexpertinSSLandSSLcertificates,butthegeneralideaisthatacertificatecaneitherbeself-signedorsignedbyatrustedthirdpartysuchasVeriSign(calledacertificateauthorityorCA).Generally,becauseofthecostassociatedwithgettingaverifiedcertificate,peopleonlydothisforwebbrowsercertificatesacustomermightseeinawebbrowser.SomeorganizationswillhaveaninternalCAwhocansigncertificates.Forthepurposeofthisexample,we’llbeusingself-signedcertificates.Thisbasicallymeansthatwe’llgenerateacertificatebutnothaveitsignedbyanyauthority.Forthis,we’llusethekeytoolutilitythatcomeswithallJavainstallations(thereferencecanbefoundathttp://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html).Here,IcreateaJavaKeyStore(JKS)filethatcontainsa2048bitkey.Whenpromptedforapassword,I’llusethepasswordstring.Makesureyouusearealpasswordinyournontestenvironments:

%keytool-genkey-aliasflumey-keyalgRSA-keystorekeystore.jks-keysize

2048

Enterkeystorepassword:password

Re-enternewpassword:password

Whatisyourfirstandlastname?

[Unknown]:SteveHoffman

Whatisthenameofyourorganizationalunit?

[Unknown]:Operations

Whatisthenameofyourorganization?

[Unknown]:Me,MyselfandI

WhatisthenameofyourCityorLocality?

[Unknown]:Chicago

WhatisthenameofyourStateorProvince?

[Unknown]:Illinois

Whatisthetwo-lettercountrycodeforthisunit?

[Unknown]:US

IsCN=SteveHoffman,OU=Operations,O="Me,MyselfandI",L=Chicago,

ST=Illinois,C=UScorrect?

[no]:yes

Enterkeypasswordfor<flumey>

(RETURNifsameaskeystorepassword):

IcanverifythecontentsoftheJKSbyrunningthelistsubcommand:

%keytool-list-keystorekeystore.jks

Enterkeystorepassword:password

Keystoretype:JKS

Keystoreprovider:SUN

Yourkeystorecontains1entry

flumey,Nov11,2014,PrivateKeyEntry,Certificatefingerprint(SHA1):

5C:BC:3C:7F:7A:E7:77:EB:B5:54:FA:E2:8B:DD:D3:66:36:86:DE:E4

Page 178: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

NowthatIhaveakeyinmykeystorefile,Icansettheadditionalpropertiesonthereceivingsource:

collector.sources.av1.ssl=true

collector.sources.av1.keystore=/path/to/keystore.jks

collector.sources.av1.keystore-password=password

AsIamusingaself-signedcertificate,thesinkwon’tbesurethatcommunicationscanbetrusted,soIhavetwooptions.Thefirstistotellittojusttrustallcertificates.Clearly,youwouldonlydothisonaprivatenetworkthathassomereasonableassurancesaboutitssecurity.Toignorethedubiousoriginofmycertificate,Icansetthetrust-all-certspropertytotrueasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.trust-all-certs=true

Ifyoudidn’tsetthis,you’dseesomethinglikethisinthelogs:

org.apache.flume.EventDeliveryException:Failedtosendevents…

Causedby:javax.net.ssl.SSLHandshakeException:GeneralSSLEngineproblem…

Causedby:sun.security.validator.ValidatorException:Notrusted

certificatefound…

ForcommunicationsoverthepublicInternetorinmultitenantcloudenvironments,morecareshouldbetaken.Inthiscase,youcanprovideatruststoretothesink.Atruststoreissimilartoakeystore,exceptthatitcontainscertificateauthoritiesyouaretellingFlumecanbetrusted.Forwell-knowncertificateauthorities,suchasVeriSignandothers,youdon’tneedtospecifyatruststore,astheiridentitiesarealreadyincludedinyourJavadistribution(atleastforthemajorones).Youwillneedtohaveyourcertificatesignedbyacertificateauthorityandthenaddthesignedcertificatefiletothesource’skeystorefile.YouwillalsoneedtoincludethesigningCA’scertificateinthekeystoresothatthecertificatechaincanbefullyresolved.

Onceyouhaveaddedthekey,signedcertificate,andsigningCA’scertificatetothesource,youneedtoconfigurethesink(thesender)totrusttheseauthoritiessothatitwillpassthevalidationstepduringtheSSLhandshake.Tospecifythetruststore,setthetruststoreandtruststore-passwordpropertiesasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.truststore=/path/to/truststore.jks

client.sinks.k1.truststore-password=password

Chancesarethereissomebodyinyourorganizationresponsibleforobtainingthethird-partycertificatesorsomeonewhocanissueanorganizationalCA-signed-certificatetoyou,soI’mnotgoingtogointodetailsabouthowtocreateyourowncertificateauthority.ThereisplentyofinformationontheInternetifyouchoosetotakethisroute.Rememberthatcertificateshaveanexpirationdate(usually,ayear),soyou’llneedtorepeatthisprocesseveryyearorcommunicationswillabruptlystopwhencertificatesorcertificateauthoritiesexpire.

Page 179: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheThriftsource/sinkAnothersource/sinkpairyoucanusetotieryourdataflowsisbasedonThrift(http://thrift.apache.org/).UnlikeAvrodata,whichisself-documenting,Thriftusesanexternalschema,whichcanbecompiledintojustabouteveryprogramminglanguageontheplanet.TheconfigurationisalmostidenticaltotheAvroexamplealreadycovered.TheprecedingcollectorconfigurationthatusesThriftwouldnowlooksomethinglikethis:

collector.sources=th1

collector.sources.th1.type=thrift

collector.sources.th1.bind=0.0.0.0

collector.sources.th1.port=42324

collector.sources.th1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Thereisoneadditionalpropertythatsetsthemaximumworkerthreadsintheunderlyingthreadpool.Ifunset,avalueofzeroisassumed,whichmakesthethreadpoolunbounded,soyoushouldprobablysetthisinyourproductionconfigurations.Testingshouldprovideyouwithareasonableupperlimitforyourenvironment.Tosetthethreadpoolsizeto10,forexample,youwouldaddthethreadsproperty:

collector.source.th1.threads=10

The“client”agentswouldusethecorrespondingThriftsinkasfollows:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1

client.sinks.k1.type=thrift

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collector.example.com

client.sinks.k1.port=42324

LikeitsAvrocounterpart,thereareadditionalsettingsforthebatchsize,connectiontimeouts,andconnectionresetintervalsthatyoumaywanttoadjustbasedonyourtestingresults.RefertotheTheAvrosource/sinksectionearlierinthischapterfordetails.

Page 180: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Usingcommand-lineAvroTheAvrosourcecanalsobeusedinconjunctionwithoneofthecommand-lineoptionsyoumayhavenoticedbackinChapter2,AQuickStartGuidetoFlume.Ratherthanrunningflume-ngwiththeagentparameter,youcanpasstheavro-clientparametertosendoneormorefilestoanAvrosource.Thesearetheoptionsspecifictoavro-clientfromthehelptext:

avro-clientoptions:

--dirname<dir>directorytostreamtoavrosource

--host,-H<host>hostnametowhicheventswillbesent(required)

--port,-p<port>portoftheavrosource(required)

--filename,-F<file>textfiletostreamtoavrosource[default:std

input]

--headerFile,-R<file>headerFilecontainingheadersaskey/valuepairson

eachnewline

--help,-hdisplayhelptext

Thisvariationisveryusefulfortesting,resendingdatamanuallyduetoerrors,orimportingolderdatastoredelsewhere.

JustlikeanAvrosink,youhavetospecifythehostnameandportyouwillbesendingdatato.Youcansendasinglefilewiththe--filenameoptionorallthefilesinadirectorywiththe--dirnameoption.Ifyouspecifyneitherofthese,stdinwillbeused.Hereishowyoumightsendafilenamedfoo.logtotheFlumeagentwepreviouslyconfigured:

$./flume-ngavro-client--filenamefoo.log--hostcollector.example.com--

port42424

EachlineoftheinputwillbeconvertedintoasingleFlumeevent.

Optionally,youcanspecifyafilecontainingkey/valuepairstosetFlumeheadervalues.ThefileusesJavapropertyfilesyntax.SupposeIhadafilenamedheaders.propertiescontaining:

pointOfSale=US

environment=staging

Then,includingthe--headerFileoptionwouldsetthesetwoheadersoneveryeventcreated:

$./flume-ngavro-client--filenamefoo.log--headerFileheaders.properties

--hostcollector.example.com--port42424

Page 181: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheLog4JappenderAswediscussedinChapter5,SourcesandChannelSelectors,thereareissuesthatmayarisefromusingafilesystemfileasasource.OnewaytoavoidthisproblemistousetheFlumeLog4JAppenderinyourJavaapplication(s).Underthehood,itusesthesameAvrocommunicationthattheAvrosinkuses,soyouneedonlyconfigureittosenddatatoanAvrosource.

TheAppenderhastwoproperties,whichareshownhereinXML:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.Log4jAppender">

<paramname="Hostname"value="collector.example.com"/>

<paramname="Port"value="42424"/>

</appender>

TheformatofthebodywillbedictatedbytheAppender’sconfiguredlayout(notshown).Thelog4jfieldsthatgetmappedtoFlumeheadersaresummarizedinthistable:

Flumeheaderkey Log4Jloggingeventfield

flume.client.log4j.logger.name event.getLoggerName()

flume.client.log4j.log.levelevent.getLevel()asanumber.Seeorg.apache.log4j.Levelformappings.

flume.client.log4j.timestamp event.getTimeStamp()

flume.client.log4j.message.encoding N/A—alwaysUTF8

flume.client.log4j.logger.otherWillonlyseethisiftherewasaproblemmappingoneoftheabovefields,sonormally,thiswon’tbepresent.

Refertohttp://logging.apache.org/log4j/1.2/formoredetailsonusingLog4J.

Youwillneedtoincludetheflume-ng-sdkJARintheclasspathofyourJavaapplicationatruntimetouseFlume’sLog4JAppender.

KeepinmindthatifthereisaproblemsendingdatatotheAvrosource,theappenderwillthrowanexceptionandthelogmessagewillbedropped,asthereisnoplacetoputit.KeepingitinmemorycouldquicklyoverloadyourJVMheap,whichisusuallyconsideredworsethandroppingthedatarecord.

Page 182: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheLog4Jload-balancingappenderI’msureyounoticedthattheprecedingLog4jAppenderonlyhasasinglehostname/portinitsconfiguration.Ifyouwantedtospreadtheloadacrossmultiplecollectoragents,eitherforadditionalcapacityorforfaulttolerance,youcanusetheLoadBalancingLog4jAppender.ThisappenderhasasinglerequiredpropertynamedHosts,whichisaspace-separatedlistofhostnamesandportnumbersseparatedbyacolon,asfollows:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

</appender>

Thereisanoptionalproperty,Selector,whichspecifiesthemethodthatyouwanttoloadbalance.ValidvaluesareRANDOMandROUND_ROBIN.Ifnotspecified,thedefaultisROUND_ROBIN.Youcanimplementyourownselector,butthatisoutsidethescopeofthisbook.Ifyouareinterested,gohavealookatthewell-documentedsourcecodefortheLoadBalancingLog4jAppenderclass.

Finally,thereisanotheroptionalpropertytooverridethemaximumtimeforexponentialbackoffwhenaservercannotbecontacted.Initially,ifaservercannotbecontacted,1secondwillneedtopassbeforethatserveristriedagain.Eachtimetheserverisunavailable,theretrytimedoubles,uptoadefaultmaximumof30seconds.Ifwewantedtoincreasethismaximumto2minutes,wecanspecifyaMaxBackoffpropertyinmillisecondsasfollows:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

<paramname="Selector"value="RANDOM"/>

<paramname="MaxBackoff"value="120000"/>

</appender>

Inthisexample,wehavealsooverriddenthedefaultround_robinselectortousearandomselection.

Page 183: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 184: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TheembeddedagentIfyouarewritingaJavaprogramthatcreatesdata,youmaychoosetosendthedatadirectlyasstructureddatausingaspecialmodeofFlumecalledtheEmbeddedAgent.Itisbasicallyasimplesinglesource/singlechannelFlumeagentthatyouruninsideyourJVM.

Therearebenefitsanddrawbackstothisapproach.Onthepositiveside,youdon’tneedtomonitoranadditionalprocessonyourserverstorelaydata.Theembeddedchannelalsoallowsforthedataproducertocontinueexecutingitscodeimmediatelyafterqueuingtheeventtothechannel.TheSinkRunnerthreadhandlestakingeventsfromthechannelandsendingthemtotheconfiguredsinks.Evenifyoudidn’tuseembeddedFlumetoperformthishandofffromthecallingthread,youwouldmostlikelyusesomekindofsynchronizedqueue(suchasBlockingQueue)toisolatethesendingofthedatafromthemainexecutionthread.UsingEmbeddedFlumeprovidesthesamefunctionalitywithouthavingtoworrywhetheryou’vewrittenyourmultithreadedcodecorrectly.

ThemajordrawbacktoembeddingFlumeinyourapplicationisaddedmemorypressureonyourJVM’sgarbagecollector.Ifyouareusinganin-memorychannel,anyunsenteventsareheldintheheapandgetinthewayofcheapgarbagecollectionbywayofshort-livedobjects.However,thisiswhyyousetachannelsize:tokeepthemaximummemoryfootprinttoaknownquantity.Furthermore,anyconfigurationchangeswillrequireanapplicationrestart,whichcanbeproblematicifyouroverallsystemdoesn’thavesufficientredundancybuiltintotoleraterestarts.

Page 185: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ConfigurationandstartupAssumingyouarenotdissuaded(andyoushouldn’tbe),youwillfirstneedtoincludetheflume-ng-embedded-agentlibrary(anddependencies)intoyourJavaproject.Dependingonwhatbuildsystemyouareusing(Maven,Ivy,Gradle,andsoon),theexactformatwilldiffer,soI’lljustshowyoutheMavenconfigurationhere.Youcanlookupthealternativeformatsathttp://mvnrepository.com/:

<dependency>

<groupId>org.apache.flume</groupId>

<artifactId>flume-ng-embedded-agent</artifactId>

<version>1.5.2</version>

</dependency>

StartbycreatinganEmbeddedAgentobjectinyourJavacodebycallingtheconstructor(andpassingastringname—usedonlyinerrormessages):

EmbeddedAgenttoHadoop=newEmbeddedAgent("myData");

Next,youhavetosetpropertiesforthechannel,sinks,andsinkprocessorviatheconfigure()method,asshowninthisexample.Mostofthisconfigurationshouldlookveryfamiliartoyouatthispoint:

Map<String,String>config=newHashMap<String,String>;

config.put("channel.type","memory");

config.put("channel.capacity","75");

config.put("sinks","s1s2");

config.put("sink.s1.type","avro");

config.put("sink.s1.hostname","foo.example.com");

config.put("sink.s1.port","12345");

config.put("sink.s1.compression-type","deflate");

config.put("sink.s2.type","avro");

config.put("sink.s2.hostname","bar.example.com");

config.put("sink.s2.port","12345");

config.put("sink.s2.compression-type","deflate");

config.put("processor.type","failover");

config.put("processor.priority.s1","10");

config.put("processor.priority.s2","20");

toHadoop.configure(config);

Here,wedefineamemorychannelwithacapacityof75eventsalongwithtwoAvrosinksinanactive/standbyconfiguration(first,tofoo.example.com,andifthatfails,tobar.example.com)usingAvroserializationwithcompression.RefertoChapter3,Channels,forspecificsettingsformemory-orfile-backedchannelproperties.

Thesinkprocessoronlycomesintoplayifyouhavemorethanonesinkdefinedinyoursinksproperty.UnliketheoptionalsinkgroupsoptionscoveredinChapter4,SinksandSinkProcessors,youneedtospecifyasinklist,evenifitisjustone.Clearly,thereisnobehaviordifferencebetweenfailoverandloadbalancewhenthereisonlyoncesink.Refertoeachspecificsink’spropertiesinChapter4,SinksandSinkProcessors,forspecificconfigurationparameters.

Finally,beforeyoustartusingthisagent,youneedtocallthestart()methodto

Page 186: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

instantiateeverythingbasedonyourpropertiesandstartallthebackgroundprocessingthreadsasfollows:

toHadoop.start();

Theclassisnowreadytostartreceivingandforwardingdata.

You’llwanttokeepareferencetothisobjectaroundforcleanuplater,aswellastopassittootherobjectsthatwillbesendingdataastheunderlyingchannelprovidesthreadsafety.Personally,IuseSpringFramework’sdependencyinjectiontoconfigureandpassreferencesatstartupratherthandoingitprogrammatically,asI’veshowninthisexample.RefertotheSpringwebsiteformoreinformation(http://spring.io/),asproperuseofSpringisawholebookuntoitself,anditisnottheonlydependencyinjectionframeworkavailabletoyou.

Page 187: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SendingdataDatacanbesenteitherassingleeventsorinbatches.Batchescansometimesbemoreefficientinhighvolumedatastreams.

Tosendasingleevent,justcalltheput()method,asshowninthisexample:

Evente=EventBuilder.withBody("HelloHadoop",Charset.forName("UTF8");

toHadoop.put(e);

Here,I’musingoneofmanymethodsavailableontheorg.apache.flume.event.EventBuilderclass.Hereisacompletelistofmethodsyoucanusetoconstructeventswiththishelperclass:

EventBuilder.withBody(byte[]body,Map<String,String>headers);

EventBuilder.withBody(byte[]body);

EventBuilder.withBody(Stringbody,Charsetcharset,Map<String,String>

headers);

EventBuilder.withBody(Stringbody,Charsetcharset);

Inordertosendseveraleventsinonebatch,youcancalltheputAll()methodwithalistofevents,asshowninthisexample,whichalsoincludesatimeheader:

Map<String,String>headers=newHashMap<String,String>;

headers.put("timestamp",Long.toString(System.currentTimeMillis());

List<Event>events=newArrayList<Event>;

events.add(EventBuilder.withBody("First".getBytes("UTF-8"),headers);

events.add(EventBuilder.withBody("Second".getBytes("UTF-8"),headers);

toHadoop.putAll(events);

Asinterceptorsarenotsupported,youwillneedtoaddanyheadersyouwanttoaddtotheeventsprogrammaticallyinJavabeforeyoucallput()orputAll().

Page 188: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ShutdownFinally,astherearebackgroundthreadswaitingfordatatoarriveonyourembeddedchannelwhicharemostlikelyholdingpersistentconnectionstoconfigureddestinations,whenitistimetoshutdownyourapplication,you’llwanttohaveFlumedoitscleanupaswell.Yousimplycallthestop()methodontheconfiguredEmbeddedAgentasfollows:

toHadoop.stop();

Page 189: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 190: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

RoutingTheroutingofdatatodifferentdestinationsbasedoncontentshouldbefairlystraightforwardnowthatyou’vebeenintroducedtoallthevariousmechanismsinFlume.

ThefirststepistogetthedatayouwanttoswitchonintoaFlumeheaderbymeansofasource-sideinterceptoriftheheaderisn’talreadyavailable.ThesecondstepistouseaMultiplexingChannelSelectoronthatheadervaluetoswitchthedatatoanalternatechannel.

Forinstance,let’ssayyouwantedtocaptureallexceptionstoHDFS.Inthisconfiguration,youcanseeeventscominginonthes1sourceviaavroonport42424.TheeventistestedtoseewhetherthebodycontainsthetextException.Ifitdoes,itcreatesanexceptionheaderkey(withthevalueofException).Thisheaderisusedtoswitchtheseeventstochannelc1,andultimately,HDFS.Iftheeventdidn’tmatchthepattern,itwouldnothavetheexceptionheaderandwouldgetpassedtothec2channelviathedefaultselectorwhereitwouldbeforwardedviaAvroserializationtoport12345onserverfoo.example.com:

agent.sources=s1

agent.sources.s1.type=avro

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=42424

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=regex_extractor

agent.sources.s1.interceptors.i1.regex=(Exception)

agent.sources.s1.interceptors.i1.serializers=ex

agent.sources.s1.intercetpros.i1.serializers.ex.name=exception

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=exception

agent.sources.s1.selector.mapping.Exception=c1

agent.sources.s1.selector.default=c2

agent.channels=c1c2

agent.channels.c1.type=memory

agent.channels.c2.type=memory

agent.sinks=k1k2

agent.sinks.k1.type=hdfs

agent.sinks.k1.channel=c1

agent.sinks.k1.hdfs.path=/logs/exceptions/%y/%M/%d/%H

agent.sinks.k2.type=avro

agent.sinks.k2.channel=c2

agent.sinks.k2.hostname=foo.example.com

agent.sinks.k2.port=12345

Page 191: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 192: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredvariousinterceptorsshippedwithFlume,including:

Timestamp:Theseareusedtoaddatimestampheader,possiblyoverwritinganexistingone.Host:ThisisusedtoaddtheFlumeagenthostnameorIPasaheaderintheevent.Static:ThisisusedtoaddstaticStringheaders.Regularexpressionfiltering:Thisisusedtoincludeorexcludeeventsbasedonamatchedregularexpression.Regularexpressionextractor:Thisisusedtocreateheadersfrommatchedregularexpressions.It’susefulforroutingwithChannelSelectors.Morphline:ThisisusedtodelegatetransformationtoaMorphlinecommandchain.Custom:Thisisusedtocreateanycustomtransformationsyouneedthatyoucan’tfindelsewhere.

WealsocoveredtieringdataflowsusingtheAvrosourceandsink.OptionalcompressionandSSLwithAvroflowswerecoveredaswell.Finally,Thriftsourcesandsinkswerebrieflycovered,assomeenvironmentsmayalreadyhaveThriftdataflowstointegratewith.

Next,weintroducedtwoLog4JAppenders,asinglepathandaload-balancingversion,fordirectintegrationwithJavaapplications.

TheEmbeddedFlumeAgentwascoveredforthosewishingtodirectlyintegratebasicFlumefunctionalityintotheirJavaapplications.

Finally,wegaveyouanexampleofusinginterceptorsinconjunctionwithaChannelSelectortoprovideroutingdecisionlogic.

Inthenextchapter,wewilldiveintoanexampletostitchtogethereverythingwehavecoveredsofar.

Page 193: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 194: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter7.PuttingItAllTogetherNowthatwe’vewalkedthroughallthecomponentsandconfigurations,let’sputtogetheraworkingend-to-endconfiguration.Thisexampleisbynomeansexhaustive,nordoesitcovereverypossiblescenarioyoumightneed,butIthinkitshouldcoveracoupleofcommonusecasesI’veseenoverandover:

FindingerrorsbysearchinglogsacrossmultipleserversinnearrealtimeStreamingdatatoHDFSforlong-termbatchprocessing

Inthefirstsituation,yoursystemsmaybeimpaired,andyouhavemultipleplaceswhereyouneedtosearchforproblems.Bringingallofthoselogstoasingleplacethatyoucansearchmeansgettingyoursystemsrestoredquickly.Inthesecondscenario,youareinterestedincapturingdatainthelongtermforanalyticsandmachinelearning.

Page 195: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

WeblogstosearchableUILet’ssimulateawebapplicationbysettingupasimplewebserverwhoselogswewanttostreamintosomesearchableapplication.Inthiscase,we’llbeusingaKibanaUItoperformadhocqueriesagainstElasticsearch.

Forthisexample,I’llstartthreeserversinAmazon’sElasticComputeCluster(EC2),asshowninthisdiagram:

EachserverhasapublicIP(startingwith54)andaprivateIP(startingwith172).Forinterservercommunication,I’llbeusingtheprivateIPsinmyconfigurations.Mypersonalinteractionwiththewebserver(tosimulatetraffic)andwithKibanaandElasticsearchwillrequirethepublicIPs,sinceI’msittinginmyhouseandnotinAmazon’sdatacenters.

NotePaycarefulattentiontotheIPaddressesintheshellprompts,aswewillbejumpingfrommachinetomachine,andIdon’twantyoutogetlost.Forinstance,ontheCollectorbox,thepromptwillcontainitsprivateIP:[ec2-user@ip-172-31-26-205~]$

IfyoutrythisoutyourselfinEC2,youwillgetdifferentIPassignments,soadjusttheconfigurationfilesandURLsreferencedasneeded.Forthisexample,I’llbeusingAmazon’sLinuxAMIforanoperatingsystemandthet1.microsizeserver.Also,besuretoadjustthesecuritygroupstoallownetworktrafficbetweentheseservers(andyourself).Whenindoubtaboutconnectivity,useutilitiessuchastelnetornctotestconnectionsbetweenserversandports.

Page 196: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TipWhilegoingthroughthisexercise,Imademistakesinmysecuritygroupsmorethanonce,soperformthesetestsonconnectivityexceptions.

Page 197: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SettingupthewebserverLet’sstartbysettingupthewebserver.Forthis,we’llinstallthepopularNginxwebserver(http://nginx.org).Sincewebserverlogsareprettymuchstandardizednowadays,Icouldhavechosenanywebserver,butthepointhereisthatthisapplicationwritesitlogs(bydefault)toafileonthedisk.SinceI’vewarnedyoufromtryingtousethetailprogramtostreamthem,we’llbeusingacombinationoflogrotateandtheSpoolingDirectorySourcetoingestthedata.First,let’slogintotheserver.ThedefaultaccountwithsudoforanAmazonLinuxserverisec2-user.You’llneedtopassthisinyoursshcommand:

%ssh54.69.112.61-lec2-user

Theauthenticityofhost'54.69.112.61(54.69.112.61)'can'tbe

established.

RSAkeyfingerprintisdc:ee:7a:1a:05:e1:36:cd:a4:81:03:97:48:8d:b3:cc.

Areyousureyouwanttocontinueconnecting(yes/no)?yes

Warning:Permanentlyadded'54.69.112.61'(RSA)tothelistofknownhosts.

__|__|_)

_|(/AmazonLinuxAMI

___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2014.09-release-notes/

18package(s)neededforsecurity,outof41available

Run"sudoyumupdate"toapplyallupdates.

Nowthatyouareloggedintotheserver,let’sinstallNginx(I’mnotgoingtoshowalloftheoutputtosavepaper):

[ec2-user@ip-172-31-18-146~]$sudoyum-yinstallnginx

Loadedplugins:priorities,update-motd,upgrade-helper

ResolvingDependencies

-->Runningtransactioncheck

--->Packagenginx.x86_641:1.6.2-1.22.amzn1willbeinstalled

[SNIP]

Installed:

nginx.x86_641:1.6.2-1.22.amzn1

DependencyInstalled:

GeoIP.x86_640:1.4.8-1.5.amzn1gd.x86_640:2.0.35-11.10.amzn1

gperftools-libs.x86_640:2.0-11.5.amzn1libXpm.x86_640:3.5.10-2.9.amzn1

libunwind.x86_640:1.1-2.1.amzn1

Complete!

Next,let’sstarttheNginxwebserver:

[ec2-user@ip-172-31-18-146~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Atthispoint,thewebservershouldberunning,solet’suseourcomputer’swebbrowsertogotothepublicIP,http://54.69.112.61/inthiscase.Ifitisworking,youshouldseethedefaultwelcomepage,likethis:

Page 198: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Tohelpgenerateasampleinput,youcanusealoadgeneratorprogramsuchasApachebenchmark(http://httpd.apache.org/docs/2.4/programs/ab.html)orwrk(https://github.com/wg/wrk).HereisthecommandIusedtogeneratedatafor20minutesatatime:

%wrk-c2-d20mhttp://54.69.112.61

Running20mtest@http://54.69.112.61

2threadsand2connections

ConfiguringlogrotationtothespooldirectoryBydefault,theserver’saccesslogsarewrittento/var/log/nginx/access.log.Theproblemweneedtoovercomehereistomovethefilestoaseparatedirectorywhileitisstillopenbythenginxprocess.Thisiswherelogrotatecomesintothepicture.Let’sfirstcreateaspooldirectorythatwillbecometheinputpathfortheSpoolingDirectorySourcelateron.Forthepurposeofthisexample,I’lljustcreatethespooldirectoryinthehomedirectoryofec2-user.Inaproductionenvironment,youwouldplacethespooldirectorysomewhereelse:

[ec2-user@ip-172-31-18-146~]$pwd

/home/ec2-user

[ec2-user@ip-172-31-18-146~]$mkdirspool

Let’salsocreatealogrotatescriptinthehomedirectoryofec2-user,calledrotateAccess.conf.Openaneditorandpastethesecontents(aftertheprompt)intoafile

Page 199: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

calledaccessRotate.conf:

[ec2-user@ip-172-31-18-146~]$cataccessRotate.conf

/var/log/nginx/access.log{

missingok

notifempty

rotate0

copytruncate

sharedscripts

olddir/home/ec2-user/spool

postrotate

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

endscript

}

Let’sgooverthissothatyouunderstandwhatitisdoing.Thisisbynomeansacompleteintroductiontothelogrotateutility.Readtheonlinedocumentationathttp://linuxconfig.org/logrotateformoredetails.

Whenlogrotateruns,itwillcopy/var/log/nginx/access.logtothe/home/ec2-user/spooldirectory,butonlyifithasanonzerolength(meaning,thereisdatatosend).Therotate0commandtellslogrotatenottoremoveanyfilesfromthetargetdirectory(sincetheFlumesourcewilldothatafterithassuccessfullytransmittedeachfile).We’llmakeuseofthecopytruncatefeaturebecauseitkeepsthefileopenasfarasthenginxprocessisconcerned,whileresettingittozerolength.Thisavoidstheneedtosignalnginxthatarotationhasjustoccurred.Thedestinationoftherotatedfileinthespooldirectoryis/home/ec2-user/spool/access.log.1.Next,inthepostrotatescriptsection,wewillchangetheownershipofthefilefromtherootusertoec2-usersothattheFlumeagentcanreadit,andlater,deleteit.Finally,werenamethefilewithauniquetimestampsothatwhenthenextrotationoccurs,theaccess.log.1filewillnotexist.Thisalsomeansthatwe’llneedtoaddanexclusionruletotheFlumesourcelatertokeepitfromreadingaccess.log.1fileuntilthecopyingprocesshascompleted.Oncerenamed,itisreadytobeconsumedbyFlume.

Nowlet’stryrunningitindebugmode,whichwillbejustadryrunsothatwecanseewhatitcando.Normally,logrotatehasadailyrotationsettingthatpreventsitfromdoinganythingunlessenoughtimehaspassed:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-d/home/ec2-

user/accessRotate.conf

readingconfigfile/home/ec2-user/accessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.log1048576bytes(nooldlogs

willbekept)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

Page 200: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

areremoved

consideringlog/var/log/nginx/access.log

logdoesnotneedrotating

notrunningpostrotatescript,sincenologswererotated

Therefore,wewillalsoincludethe-f(force)flagtooverridetheprecedingfunctionality:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-

user/accessRotate.conf

readingconfigfile/home/ec2-user/accessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(no

oldlogswillbekept)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

areremoved

consideringlog/var/log/nginx/access.log

logneedsrotating

rotatinglog/var/log/nginx/access.log,log->rotateCountis0

dateextsuffix'-20141207'

globpattern'-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'

renaming/home/ec2-user/spool/access.log.1to/home/ec2-

user/spool/access.log.2(rotatecount1,logstart1,i1),

renaming/home/ec2-user/spool/access.log.0to/home/ec2-

user/spool/access.log.1(rotatecount1,logstart1,i0),

copying/var/log/nginx/access.logto/home/ec2-user/spool/access.log.1

truncating/var/log/nginx/access.log

runningpostrotatescript

runningscriptwitharg/var/log/nginx/access.log:"

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

"

removingoldlog/home/ec2-user/spool/access.log.2

Asyoucansee,logrotatewillcopythelogtothespooldirectorywiththe.1extensionasexpected,followedbyourscriptblockattheendtochangepermissionsandrenameitwithauniquetimestamp.

Nowlet’srunitagain,butthistimeforreal,withoutthedebugflag,andthenwewilllistthesourceandtargetdirectoriestoseewhathappened:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-f/home/ec2-

user/accessRotate.conf

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total188

-rw-r--r--1ec2-userec2-user189344Dec717:27access.log.1417973241

[ec2-user@ip-172-31-18-146~]$ls-l/var/log/nginx/

total4

-rw-r--r--1rootroot0Dec717:27access.log

-rw-r--r--1rootroot520Dec716:44error.log

[ec2-user@ip-172-31-18-146~]$

Page 201: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Youcanseetheaccesslogisemptyandtheoldcontentshavebeencopiedtothespooldirectory,withcorrectpermissionsandthefilenameendinginauniquetimestamp.

Let’salsoverifythattheemptyaccesslogwillresultinnoactionifrunwithoutanydata.You’llneedtoincludethedebugflagtoseewhattheprocessisthinking:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-

user/accessRotate.conf

readingconfigfileaccessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(1

rotations)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

areremoved

consideringlog/var/log/nginx/access.log

logdoesnotneedrotating

Nowthatwehaveaworkingrotationscript,weneedsomethingtorunitperiodically(notindebugmode)insomechoseninterval.Forthis,wewillusethecrondaemon(http://en.wikipedia.org/wiki/Cron)bycreatingafileinthe/etc/cron.ddirectory:

[ec2-user@ip-172-31-18-146~]$cat/etc/cron.d/rotateLogsToSpool

#Movefilestospooldirectoryevery5minutes

*/5****root/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf

Here,I’veindicatedafive-minuteintervalandtorunastherootuser.Sincetherootuserownstheoriginallogfile,weneedelevatedaccesstoreassignownershiptoec2-usersothatitcanberemovedlaterbyFlume.

Onceyou’vesavedthiscronconfigurationfile,youshouldseeourscriptevery5minutes(aswellasotherprocesses),byinspectingthecrondaemon’slogfile:

[ec2-user@ip-172-31-18-146~]$sudotail/var/log/cron

Dec717:45:01ip-172-31-18-146CROND[22904]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec717:50:01ip-172-31-18-146CROND[22929]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'started

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'

terminated

Dec717:55:01ip-172-31-18-146CROND[22955]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec718:00:01ip-172-31-18-146CROND[23046]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec718:01:01ip-172-31-18-146CROND[23060]:(root)CMD(run-parts

/etc/cron.hourly)

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23060]:

starting0anacron

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23069]:

finished0anacron

Dec718:05:01ip-172-31-18-146CROND[23117]:(root)CMD

Page 202: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Aninspectionofthespooldirectoryalsoshowsnewfilescreatedonourarbitrarilychosen5-minuteinterval:

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total1072

-rw-r--r--1ec2-userec2-user145029Dec718:02access.log.1417975373

-rw-r--r--1ec2-userec2-user164169Dec718:05access.log.1417975501

-rw-r--r--1ec2-userec2-user166779Dec718:10access.log.1417975801

-rw-r--r--1ec2-userec2-user613785Dec718:15access.log.1417976101

KeepinmindthattheNginxRPMpackagealsoinstalledarotationconfigurationlocatedat/etc/logrotate.d/nginxtoperformdailyrotations.Ifyouaregoingtousethisexampleinproduction,you’llwanttoremovetheconfiguration,sincewedon’twantitclashingwithourmorefrequentlyrunningcronscript.I’llleavehandlingerror.logasanexerciseforyou.You’lleitherwanttosenditsomeplace(andhaveFlumeremoveit)orrotateitperiodicallysothatyourdiskdoesn’tfillupovertime.

Page 203: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Settingupthetarget–ElasticsearchNowlet’smovetotheotherserverinourdiagramandsetupElasticsearch,ourdestinationforsearchabledata.I’mgoingtobeusingtheinstructionsfoundathttp://www.elasticsearch.org/overview/elkdownloads/.First,let’sdownloadandinstalltheRPMpackage:

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

--2014-12-0718:25:35--

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

100%

[==========================================================================

==================================>]26,326,1549.92MB/sin2.5s

2014-12-0718:25:38(9.92MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-120~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################[100%]

Updating/installing…

1:elasticsearch-1.4.1-1#################################[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Theinstallationiskindenoughtotellmehowtoconfiguretheservicetoautomaticallystartonsystemboot-up,anditalsotellsmetolaunchtheservicenow,solet’sdothat:

[ec2-user@ip-172-31-26-120~]$sudo/sbin/chkconfig--addelasticsearch

[ec2-user@ip-172-31-26-120~]$sudoserviceelasticsearchstart

Startingelasticsearch:[OK]

Let’sperformaquicktestwithcurltoverifythatitisrunningonthedefaultport,whichis9200:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/

{

"status":200,

"name":"Siege",

"cluster_name":"elasticsearch",

"version":{

"number":"1.4.1",

Page 204: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

"build_hash":"89d3241d670db65f994242c8e8383b169779e2d4",

"build_timestamp":"2014-11-26T15:49:29Z",

"build_snapshot":false,

"lucene_version":"4.10.2"

},

"tagline":"YouKnow,forSearch"

}

Sincetheindexesarecreatedasdatacomesin,andwehaven’tsentanydata,weexpecttoseenoindexescreatedyet:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

Allweseeistheheaderwithnoindexeslisted,solet’sheadovertothecollectorserverandsetuptheFlumerelaywhichwillwritedatatoElasticsearch.

Page 205: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SettingupFlumeoncollector/relayTheFlumeconfigurationontheFlumecollectorwillbecompressedAvrocominginandElasticsearchgoingout.Goaheadandlogintothecollectorserver(172.31.26.205).

Let’sstartbydownloadingtheFlumebinaryfromtheApachewebsite.Followthedownloadlinkandselectamirror:

[ec2-user@ip-172-31-26-205~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0719:50:30--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.

HTTPrequestsent,awaitingresponse…200OK

Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[==========================>]25,323,4598.38MB/sin2.9s

2014-12-0719:50:33(8.38MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

Next,expandandchangedirectories:

[ec2-user@ip-172-31-26-205~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin

[[email protected]]$

Createtheconfigurationfile,calledcollector.conf,inFlume’sconfigurationdirectoryusingyourfavoriteeditor:

[[email protected]]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

nk

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

Here,youcanseethatweareusingasimplememorychannelconfiguredwithacapacityof10,000events.

ThesourceisconfiguredtoacceptcompressedAvroonport12345andpassittoour

Page 206: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

memorychannel.

Finally,thesinkisconfiguredtowritetoElasticsearchontheserverwejustsetupatthe172.31.26.120privateIP.Weareusingthedefaultsettings,whichmeansit’llwritetotheindexnamedflume-YYYY-MM-DDwiththelogtype.

Let’stryrunningtheFlumeagent:

[[email protected]]$./bin/flume-ngagent-n

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

You’llseeanexceptioninthelog,includingsomethinglikethis:

2014-12-0720:00:13,184(conf-file-poller-0)[ERROR-

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)]Failed

tostartagentbecausedependencieswerenotfoundinclasspath.Error

follows.

java.lang.NoClassDefFoundError:org/elasticsearch/common/io/BytesStream

TheFlumeagentcan’tfindtheElasticsearchclasses.RememberthatthesearenotpackagedwithFlume,asthelibrariesneedtobecompatiblewiththeversionofElasticsearchyouarerunning.LookingbackattheElasticsearchserver,wecangetanideaofwhatweneed.Rememberthatthislistincludesmanyruntimeserverdependencies,soitisprobablymorethanwhatyou’llneedforafunctionalElasticsearchclient:

[ec2-user@ip-172-31-26-120~]$rpm-qilelasticsearch|grepjar

/usr/share/elasticsearch/lib/elasticsearch-1.4.1.jar

/usr/share/elasticsearch/lib/groovy-all-2.3.2.jar

/usr/share/elasticsearch/lib/jna-4.1.0.jar

/usr/share/elasticsearch/lib/jts-1.13.jar

/usr/share/elasticsearch/lib/log4j-1.2.17.jar

/usr/share/elasticsearch/lib/lucene-analyzers-common-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-core-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-expressions-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-grouping-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-highlighter-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-join-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-memory-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-misc-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queries-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queryparser-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-sandbox-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-spatial-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-suggest-4.10.2.jar

/usr/share/elasticsearch/lib/sigar/sigar-1.6.4.jar

/usr/share/elasticsearch/lib/spatial4j-0.4.1.jar

Really,youonlyneedtheelasticsearch.jarfileanditsdependencies,butwearegoingtobelazyandjustdownloadtheRPMagaininthecollectormachine,andcopytheJARfilestoFlume:

[ec2-user@ip-172-31-26-205~]$wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

Page 207: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

--2014-12-0720:03:38--

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

100%

[==========================================================================

==================================>]26,326,1549.70MB/sin2.6s

2014-12-0720:03:41(9.70MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-205~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################

[100%]

Updating/installing…

1:elasticsearch-1.4.1-1#################################

[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Thistimewewillnotconfiguretheservicetostart.Instead,we’llcopytheJARfilesweneedtoFlume’spluginsdirectoryarchitecture,whichwelearnedaboutinthepreviouschapter:

[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin

[[email protected]]$mkdir-p

plugins.d/elasticsearch/libext

[[email protected]]$cp

/usr/share/elasticsearch/lib/*.jarplugins.d/elasticsearch/libext/

NowtryrunningtheFlumeagentagain:

[[email protected]]$./bin/flume-ngagent-n

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

Noexceptionsthistimearound,butstillnodata.Let’sgobacktothewebservermachineandsetupthefinalFlumeagent.

Page 208: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SettingupFlumeontheclientOnthefirstserver,thewebserver,let’sdownloadtheFlumebinariesagainandexpandthepackage:

[ec2-user@ip-172-31-18-146~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0720:12:53--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.

HTTPrequestsent,awaitingresponse…200OK

Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[=========================>]25,323,4598.13MB/sin3.0s

2014-12-0720:12:56(8.13MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

[ec2-user@ip-172-31-18-146~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

[ec2-user@ip-172-31-18-146~]$cdapache-flume-1.5.2-bin

Thistime,ourFlumeconfigurationtakesthespooldirectorywesetupbefore,withlogrotateasinput,anditneedstowritecompressedAvroasthecollectorserver.Openaneditorandcreatetheclient.conffileusingyourfavoriteeditor:

[[email protected]]$catconf/client.conf

client.sources=sd

client.channels=m1

client.sinks=av

client.sources.sd.type=spooldir

client.sources.sd.spoolDir=/home/ec2-user/spool

client.sources.sd.deletePolicy=immediate

client.sources.sd.ignorePattern=access.log.1$

client.sources.sd.channels=m1

client.channels.m1.type=memory

client.channels.m1.capacity=10000

client.sinks.av.type=avro

client.sinks.av.hostname=172.31.26.205

client.sinks.av.port=12345

client.sinks.av.compression-type=deflate

client.sinks.av.channel=m1

Again,forsimplicity,weareusingamemorychannelwith10,000-recordcapacity.

Forthesource,weconfiguretheSpoolingDirectorySourcewith/home/ec2-user/spoolastheinput.Additionally,weconfigurethedeletionpolicytoremovethefilesaftersendingiscomplete.Thisisalsowherewesetuptheexclusionrulefortheaccess.log.1filenamepatternmentionedearlier.Notethedollarsignattheendofthefilename,

Page 209: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

denotingtheendoftheline.Withoutthis,theexclusionpatternwouldalsoexcludevalidfiles,suchasaccess.log.1417975373.

Finally,anAvrosinkisconfiguredtopointatthecollector’sprivateIPandport12345.Additionally,wesetthecompressionsothatitmatchesthereceivingAvrosource’ssettings.

Nowlet’stryrunningtheagent:

[[email protected]]$./bin/flume-ngagent-n

client-cconf-fconf/client.conf-Dflume.root.logger=INFO,console

Noexceptions!Butmoreimportantly,Iseethelogfilesinthespooldirectorybeingprocessedanddeleted:

2014-12-0720:59:04,041(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417976401

2014-12-0720:59:05,319(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417976701

2014-12-0720:59:06,245(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417977001

2014-12-0720:59:06,245(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

2014-12-0720:59:06,746(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

Onceallthefileshavebeenprocessed,thelastlinesarerepeatedevery500milliseconds.ThisisaknownbuginFlume(https://issues.apache.org/jira/browse/FLUME-2385).Ithasalreadybeenfixedandisslatedforthe1.6.0release,sobesuretosetuplogrotationonyourFlumeagentandcleanupthismessbeforeyourunoutofdiskspace.

Onthecollectorbox,weseewritesoccurringtoElasticsearch:

2014-12-0722:10:03,694(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-

org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.exe

cute(ElasticSearchTransportClient.java:181)]Sendingbulktoelasticsearch

cluster

QueryingtheElasticsearchRESTAPI,wecanseeanindexwithrecords:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

yellowopenflume-2014-12-07513210201.3mb

1.3mb

Let’sreadthefirst5records:

Page 210: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"took":26,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

},

"hits":{

"total":32102,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVVbObD75ecNqJg",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJj",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJo",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJt",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJy",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

}]

}

}

Asyoucansee,theloglinesareinthereunderthe@messagefield,andtheycannowbesearched.However,wecandobetter.Let’sbreakthatmessagedownintosearchablefields.

Page 211: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CreatingmoresearchfieldswithaninterceptorLet’sborrowsomecodefromwhatwecoveredearlierinthisbooktoextractsomeFlumeheadersfromthiscommonlogformat,knowingthatallFlumeheaderswillbecomefieldsinElasticsearch.SincewearecreatingfieldstobesearchedbyinElasticsearch,I’mgoingtoaddthemtothecollector’sconfigurationratherthanthewebserver’sFlumeagent.

ChangetheagentconfigurationonthecollectortoincludeaRegularExpressionExtractorinterceptor:

[[email protected]]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.sources.av.interceptors=e1

collector.sources.av.interceptors.e1.type=regex_extractor

collector.sources.av.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

collector.sources.av.interceptors.e1.serializers=ipdturlscbc

collector.sources.av.interceptors.e1.serializers.ip.name=source

collector.sources.av.interceptors.e1.serializers.dt.type=org.apache.flume.i

nterceptor.RegexExtractorInterceptorMillisSerializer

collector.sources.av.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:dd:

HH:mm:ssZ

collector.sources.av.interceptors.e1.serializers.dt.name=timestamp

collector.sources.av.interceptors.e1.serializers.url.name=http_request

collector.sources.av.interceptors.e1.serializers.sc.name=status_code

collector.sources.av.interceptors.e1.serializers.bc.name=bytes_xfered

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

nk

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

TofollowtheLogstashconvention,I’verenamedthehostnameheadertosource.Now,tomakethenewformateasytofind,I’mgoingtodeletetheexistingindex:

[ec2-user@ip-172-31-26-120~]$curl-XDELETE'http://localhost:9200/flume-

2014-12-07/'

{"acknowledged":true}

[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-

2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"error":"IndexMissingException[[flume-2014-12-07]missing]",

"status":404

Page 212: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

}

Next,Icreatemoretrafficonmywebserverandwaitforittoappearattheotherend.ThenIquerysomerecordstoseewhatitlookslike:

[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-

2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"took":95,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

},

"hits":{

"total":12083,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_H",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:15

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:15.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975995000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_M",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_R",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_W",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

Page 213: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_a",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

}]

}

}

Asyoucansee,wenowhaveadditionalfieldsthatwecansearchby.Additionally,youcanseethatElasticsearchhastakenourmillisecond-basedtimestampfieldandcreateditsown@timestampfieldinISO-8601format.

Nowwecandosomemoreinterestingqueriessuchasfindingouthowmanysuccessful(status200)pageswesaw:

[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_count'-d'{"query":{"term":

{"status_code":"200"}}}'

{"count":46018,"_shards":{"total":5,"successful":5,"failed":0}}

Page 214: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Settingupabetteruserinterface–KibanaWhileitappearsasifwearedone,thereisonemorethingweshoulddo.Thedataisallthere,butyouneedtobeanexpertinqueryingElasticsearchtomakegooduseoftheinformation.Afterall,ifthedataisdifficulttoconsumeandgetsignored,thenwhybothercollectingitatall?Whatyoureallyneedisanice,searchablewebinterfacethathumanswithnon-technicalbackgroundscanuse.Forthis,wearegoingtosetupKibana.Inanutshell,KibanaisawebapplicationthatrunsasadynamicHTMLpageonyourbrowser,makingcallsfordatatoElasticsearchwhennecessary.Theresultisaninteractivewebinterfacethatdoesn’trequireyoutolearnthedetailsoftheElasticsearchqueryAPI.Thisisnottheonlyoptionavailabletoyou;itisjustwhatI’musinginthisexample.Let’sdownloadKibana3fromtheElasticsearchwebsite,andinstallthisonthesameserver(althoughyoucouldeasilyservethisfromanotherHTTPserver):

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz

--2014-12-0718:31:16--

https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:1074306(1.0M)[application/octet-stream]

Savingto:'kibana-3.1.2.tar.gz'

100%

[==========================================================================

==================================>]1,074,3061.33MB/sin0.8s

2014-12-0718:31:17(1.33MB/s)-'kibana-3.1.2.tar.gz'saved

[1074306/1074306]

[ec2-user@ip-172-31-26-120~]$tar-zxfkibana-3.1.2.tar.gz

[ec2-user@ip-172-31-26-120~]$cdkibana-3.1.2

NoteAtthetimeofwritingthisbook,anewerversionofKibana(Kibana4)isinbeta.Theseinstructionsmaybeoutdatedbythetimethisbookisreleased,butrestassuredthatsomewhereintheKibanasetup,youcaneditaconfigurationfiletopointtoyourElasticsearchserver.Ihavenottriedoutthenewerversionyet,butthereisagoodoverviewofitontheElasticsearchblogathttp://www.elasticsearch.org/blog/kibana-4-beta-3-now-more-filtery.

Opentheconfig.jsfiletoeditthelinewiththeelasticsearchkey.ThisneedstobetheURLofthepublicnameorIPoftheElasticsearchAPI.Forourexample,thislineshouldlookasshownhere:

elasticsearch:'http://ec2-54-148-230-252.us-west-

2.compute.amazonaws.com:9200',

Page 215: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Nowweneedtoprovidethisdirectorywithawebbrowser.Let’sdownloadandinstallNginxagain:

[ec2-user@ip-172-31-26-120~]$sudoyuminstallnginx

Bydefault,therootdirectoryis/usr/share/nginx/html.WecanchangethisconfigurationinNginx,buttomakethingseasy,let’sjustcreateasymboliclinktopointtotherightlocation.First,movetheoriginalpathoutofthewaybyrenamingit:

[ec2-user@ip-172-31-26-120~]$sudomv/usr/share/nginx/html

/usr/share/nginx/html.dist

Next,linktheconfiguredKibanadirectoryasthenewwebroot:

[ec2-user@ip-172-31-26-120~]$sudoln-s~/kibana-3.1.2

/usr/share/nginx/html

Finally,startthewebserver:

[ec2-user@ip-172-31-26-120~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Fromyourcomputer,gotohttp://54.148.230.252/.Ifyouseethispage,itmeansyoumayhavemadeamistakeinyourKibanaconfiguration:

Thiserrorpagemeansyourwebbrowsercan’tconnecttoElasticsearch.Youmayneedtoclearyourbrowsercacheifyoufixedtheconfiguration,aswebpagesaretypicallycachedlocallyforsomeperiodoftime.Ifyougotitright,thescreenshouldlooklikethis:

Page 216: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

GoaheadandselectSampleDashboard,thefirstoption.Youshouldseesomethinglikewhatisshowninthenextscreenshot.Thisincludessomeofthedataweingested,recordcountsinthecenter,andthefilteringfieldsintheleft-handmargin.

Page 217: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

I’mnotgoingtoclaimtobeaKibanaexpert(I’mfarfromthat),soI’llleavefurthercustomizationofthistoyou.Usethisasabasetogobackandmakeadditionalmodificationstothedatatomakeiteasiertoconsume,search,orfilter.Todothat,you’llprobablyneedtogetmorefamiliarwithhowElasticsearchworks,butthat’sokaybecauseknowledgeisn’tabadthing.Acopiousamountofdocumentationiswaitingforyouathttp://www.elasticsearch.org/guide/en/kibana/current/index.html.

Atthispoint,wehavecompletedanend-to-endimplementationofdatafromawebserverstreamedinanearreal-timefashiontoaweb-basedtoolforsearching.Sinceboththesourceandtargetformatsweredictatedbyothers,weusedaninterceptortotransformthe

Page 218: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

dataenroute.Thisusecaseisverygoodforshort-termtroubleshooting,butit’sclearlynotvery“Hadoopy”sincewehaveyettousecoreHadoop.

Page 219: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 220: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ArchivingtoHDFSWhenpeoplespeakofHadoop,theyusuallyrefertostoringlotsofdataforalongtime,usuallyinHDFS,somoreinterestingdatascienceormachinelearningcanbedonelater.Let’sextendourusecasebysplittingthedataflowatthecollectortostoreanextracopyinHDFSforlateruse.

So,backinAmazonAWS,IstartafourthservertorunHadoop.IfyouplanondoingallyourworkinHadoop,you’llprobablywanttowritethisdatatoS3,butforthisexample,let’sstickwithHDFS.Nowourserverdiagramlookslikethis:

IusedCloudera’sone-lineinstallationinstructionstospeedupthesetup.It’sinstructionscanbefoundathttp://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_qs_mrv1_pseudo.html

SincetheAmazonAMIiscompatiblewithEnterpriseLinux6,IselectedtheEL6RPMRepositoryandimportedthecorrespondingGPGkey:

[ec2-user@ip-172-31-7-175~]$sudorpm-ivh

http://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

Retrievinghttp://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

Preparing…#################################

[100%]

Updating/installing…

1:cloudera-cdh-5-0#################################

[100%]

[ec2-user@ip-172-31-7-175~]$sudorpm--import

http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Next,Iinstalledthepseudo-distributedconfigurationtoruninasingle-nodeHadoop

Page 221: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

cluster:

[ec2-user@ip-172-31-7-175~]$sudoyuminstall-yhadoop-0.20-conf-pseudo

ThismighttakeawhileasitdownloadsalltheClouderaHadoopdistributiondependencies.

Sincethisconfigurationisforsingle-nodeuse,weneedtoadjustthefs.defaultFSpropertyin/etc/hadoop/conf/core-site.xmltoadvertiseourprivateIPinsteadoflocalhost.Ifwedon’tdothis,thenamenodeprocesswillbindto127.0.0.1,andotherservers,suchasourcollector’sFlumeagent,willnotbeabletocontactit:

<property>

<name>fs.defaultFS</name>

<value>hdfs://172.31.7.175:8020</value>

</property>

Next,weformatthenewHDFSvolumeandstarttheHDFSdaemon(sincethatisallweneedforthisexample):

[ec2-user@ip-172-31-7-175~]$sudo-uhdfshdfsnamenode-format

[ec2-user@ip-172-31-7-175~]$forxin'cd/etc/init.d;lshadoop-hdfs-*'

;dosudoservice$xstart;done

startingdatanode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-

172-31-7-175.out

StartedHadoopdatanode(hadoop-hdfs-datanode):[OK]

startingnamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-

172-31-7-175.out

StartedHadoopnamenode:[OK]

startingsecondarynamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-

secondarynamenode-ip-172-31-7-175.out

StartedHadoopsecondarynamenode:[OK]

[ec2-user@ip-172-31-7-175~]$hadoopfs-df

FilesystemSizeUsedAvailableUse%

hdfs://172.31.7.175:802083187834882457666618245120%

NowthatHDFSisrunning,let’sgobacktothecollectorboxconfigurationtocreateasecondchannelandanHDFSSinkbyaddingtheselines:

collector.channels.h1.type=memory

collector.channels.h1.capacity=10000

collector.sinks.hadoop.type=hdfs

collector.sinks.hadoop.channel=h1

collector.sinks.hadoop.hdfs.path=hdfs://172.31.7.175/access_logs/%Y/%m/%d/%H

collector.sinks.hadoop.hdfs.filePrefix=access

collector.sinks.hadoop.hdfs.rollInterval=60

collector.sinks.hadoop.hdfs.rollSize=0

collector.sinks.hadoop.hdfs.rollCount=0

Thenwemodifythetop-levelchannelsandsinkskeys:

collector.channels=m1h1

collector.sinks=eshadoop

Asyoucansee,I’vegonewithasimplememorychannelagain,butfeelfreetousea

Page 222: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

durablefilechannelifyouneedit.FortheHDFSconfiguration,I’llbeusingadatedfilepathfromthe/access_logsrootdirectorywitha60-secondrotationregardlessofsize.Wearenotalteringthesourcejustyet,sodon’tworry.

Ifweattempttostartthecollectornow,weseethisexception:

java.lang.NoClassDefFoundError:

org/apache/hadoop/io/SequenceFile$CompressionType

RememberthatforApacheFlumetospeaktoHDFS,weneedcompatibleHDFSclassesanddependenciesfortheversionofHadoopwearespeakingto.Let’sgetsomehelpfromourfriendsatClouderaandinstalltheHadoopclientRPM(outputremovedtosavepaper):

[ec2-user@ip-172-31-26-205~]$sudorpm-ivh

http://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

[ec2-user@ip-172-31-26-205~]$sudorpm--import

http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

[ec2-user@ip-172-31-26-205~]$sudoyuminstallhadoop-client

TestwhetherRPMworksbycreatingthedestinationdirectoryforourdata.We’llalsosetpermissionstomatchtheaccountwe’llberunningtheFlumeagentunder(ec2-userinthiscase):

[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-mkdir

hdfs://172.31.7.175/access_logs

[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-chownec2-user:ec2-

userhdfs://172.31.7.175/access_logs

[ec2-user@ip-172-31-26-205~]$hadoopfs-lshdfs://172.31.7.175/

Found1items

drwxr-xr-x-ec2-userec2-user02014-12-1104:35

hdfs://172.31.7.175/access_logs

Now,ifwerunthecollectorFlumeagent,weseenoexceptionsduetomissingHDFSclasses.TheFlumestartupscriptdetectsthatwehavealocalHadoopinstallation,andappendsitsclasspathtoitsown.

Finally,wecangobacktotheFlumeconfigurationfile,andsplittheincomingdataatthesourcebylistingbothchannelsonthesourceasdestinationchannels:

collector.sources.av.channels=m1h1

Bydefault,areplicatingchannelselectorisused.Thisiswhatwewant,sothatnofurtherconfigurationisneeded.SavetheconfigurationfileandrestarttheFlumeagent.

Whendataisflowing,youshouldseetheexpectedHDFSactivityintheFlumecollectoragent’slogs:

2014-12-1105:05:04,141(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-

org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:261)]

Creating

hdfs://172.31.7.175/access_logs/2014/12/11/05/access.1418274302083.tmp

Page 223: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

YoushouldalsoseedataappearinginHDFS:

[ec2-user@ip-172-31-7-175~]$ls-R/access_logs

drwxr-xr-x-ec2-userec2-user02014-12-1105:05

/access_logs/2014

drwxr-xr-x-ec2-userec2-user02014-12-1105:05

/access_logs/2014/12

drwxr-xr-x-ec2-userec2-user02014-12-1105:05

/access_logs/2014/12/11

drwxr-xr-x-ec2-userec2-user02014-12-1105:06

/access_logs/2014/12/11/05

-rw-r--r--3ec2-userec2-user104862014-12-1105:05

/access_logs/2014/12/11/05/access.1418274302082

-rw-r--r--3ec2-userec2-user104862014-12-1105:05

/access_logs/2014/12/11/05/access.1418274302083

-rw-r--r--3ec2-userec2-user64292014-12-1105:06

/access_logs/2014/12/11/05/access.1418274302084

IfyourunthiscommandfromaserverotherthantheHadoopnode,you’llneedtospecifythefullHDFSURI(hdfs://172.31.7.175/access_logs)orsetthedefault.FSpropertyinthecore-site.xmlconfigurationfileinHadoop.

NoteInretrospect,Iprobablyshouldhaveconfiguredthedefault.FSpropertyonanyinstalledHadoopclientthatwilluseHDFStosavemyselfalotoftyping.Liveandlearn!

LiketheprecedingElasticsearchimplementation,youcannowusethisexampleasabaseend-to-endconfigurationforHDFS.Thenextstepwillbetogobackandmodifytheformat,compression,andsoon,ontheHDFSSinktobettermatchwhatyou’lldowiththedatalater.

Page 224: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 225: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,weiterativelyassembledanend-to-enddataflow.WestartedbysettingupanNginxwebservertocreateaccesslogs.Wealsoconfiguredcrontoexecutealogrotateconfigurationperiodicallytosafelyrotateoldlogstoaspoolingdirectory.

Next,weinstalledandconfiguredasingle-nodeElasticsearchserverandtestedsomeinsertionsanddeletions.ThenweconfiguredaFlumeclienttoreadinputfromourspoolingdirectoryfilledwithweblogs,andrelaythemtoaFlumecollectorusingcompressedAvroserialization.ThecollectorthenrelayedtheincomingdatatoourElasticsearchserver.

Oncewesawdataflowingfromoneendtoanother,wesetupasingle-nodeHDFSserverandmodifiedourcollectorconfigurationtosplittheinputdatafeedandrelayacopyofthemessagetoHDFS,simulatingarchivalstorage.Finally,wesetupaKibanaUIinfrontofourElasticsearchinstancetoprovideaneasy-searchfunctionfornontechnicalconsumers.

Inthenextchapter,wewillcovermonitoringFlumedataflowsusingGanglia.

Page 226: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 227: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter8.MonitoringFlumeTheuserguideforFlumestates:

MonitoringinFlumeisstillaworkinprogress.Changescanhappenveryoften.SeveralFlumecomponentsreportmetricstotheJMXplatformMBeanserver.ThesemetricscanbequeriedusingJconsole.

WhileJMXisfineforcasualbrowsingofmetricvalues,thenumberofeyeballslookingatJconsoledoesn’tscalewhenyouhavehundredsoreventhousandsofserverssendingdataallovertheplace.Whatyouneedisawaytowatcheverythingatonce.However,whataretheimportantthingstolookfor?Thatisaverydifficultquestion,butI’lltryandcoverseveraloftheitemsthatareimportant,aswecovermonitoringoptionsinthischapter.

Page 228: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MonitoringtheagentprocessThemostobvioustypeofmonitoringyou’llwanttoperformisFlumeagentprocessmonitoring,thatis,makingsuretheagentisstillrunning.Therearemanyproductsthatdothiskindofprocessmonitoring,sothereisnowaywecancoverthemall.Ifyouworkatacompanyofanyreasonablesize,chancesarethereisalreadyasysteminplaceforthis.Ifthisisthecase,donotgooffandbuildyourown.Thelastthingoperationswantsisyetanotherscreentowatch24/7.

Page 229: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MonitIfyoudonotalreadyhavesomethinginplace,onefreemiumoptionisMonit(http://mmonit.com/monit/).ThedevelopersofMonithaveapaidversionthatprovidesmorebellsandwhistlesyoumaywanttoconsider.Eveninthefreeform,itcanprovideyouwithawaytocheckwhethertheFlumeagentisrunning,restartitifitisn’t,andsendyouane-mailwhenthishappenssothatyoucanlookintowhyitdied.

Monitdoesmuchmore,butthisfunctionalityiswhatwewillcoverhere.Ifyouaresmart,andIknowyouare,youwilladdcheckstothedisk,CPU,andmemoryusageasaminimum,inadditiontowhatwecoverinthischapter.

Page 230: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

NagiosAnotheroptionforFlumeagentprocessmonitoringisNagios(http://www.nagios.org/).LikeMonit,youcanconfigureittowatchyourFlumeagentsandalertyouviaawebUI,e-mail,oranSNMPtrap.Thatsaid,itdoesn’thaverestartcapabilities.Thecommunityisquitestrong,andtherearemanypluginsforotheravailableapplications.

MycompanyusesthistochecktheavailabilityofHadoopwebUIs.Whilenotacompletepictureofhealth,itdoesprovidemoreinformationtotheoverallmonitoringofourHadoopecosystem.

Again,ifyoualreadyhavetoolsinplaceatyourcompany,seewhetheryoucanreusethembeforebringinginanothertool.

Page 231: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 232: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MonitoringperformancemetricsNowthatwehavecoveredsomeoptionsforprocessmonitoring,howdoyouknowwhetheryourapplicationisactuallydoingtheworkyouthinkitis?Onmanyoccasions,I’veseenastucksyslog-ngprocessthatappearstoberunning,butitjustwasn’tsendinganydata.I’mnotpickingonsyslog-ngspecifically;allsoftwaredoesthiswhenconditionsthatarenotdesignedforoccur.

WhentalkingaboutFlumedataflows,youneedtomonitorthefollowing:

DataenteringsourcesiswithinexpectedratesDataisn’toverflowingyourchannelsDataisexitingsinksatexpectedrates

Flumehasapluggablemonitoringframework,butasmentionedatthebeginningofthechapter,itisstillverymuchaworkinprogress.Thisdoesnotmeanyoushouldn’tuseit,asthatwouldbefoolish.Itmeansyou’llwanttoprepareextratestingandintegrationtimeanytimeyouupgrade.

NoteWhilenotcoveredintheFlumedocumentation,itiscommontoenableJMXinyourFlumeJVM(http://bit.ly/javajmx)andusetheNagiosJMXplugin(http://bit.ly/nagiosjmx)toalertyouaboutperformanceabnormalitiesinyourFlumeagents.

Page 233: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

GangliaOneoftheavailablemonitoringoptionstowatchFlumeinternalmetricsisGangliaintegration.Ganglia(http://ganglia.sourceforge.net/)isanopensourcemonitoringtoolthatisusedtocollectmetricsanddisplaygraphs,anditcanbetieredtohandleverylargeinstallations.TosendyourFlumemetricstoyourGangliacluster,youneedtopasssomepropertiestoyouragentatstartuptime:

Javaproperty Value Description

flume.monitoring.type ganglia Settoganglia

flume.monitoring.hostshost1:port1,host2:port2

Acomma-separatedlistofhost:portpairsforyourgmondprocess(es)

flume.monitoring.pollFrequency 60Thenumberofsecondsbetweensendingofdata(default60seconds)

flume.monitoring.isGanglia3 falseSettotrueifusingolderGanglia3protocol.Thedefaultprocessistosenddatausingv3.1protocol.

Lookateachinstanceofgmondwithinthesamenetworkbroadcastdomain(asreachabilityisbasedonmulticastpackets),andfindtheudp_recv_channelblockingmond.conf.Let’ssayIhadtwonearbyserverswiththesetwocorrespondingconfigurationblocks:

udp_recv_channel{

mcast_join=239.2.14.22

port=8649

bind=239.2.14.22

retry_bind=true

}

udp_recv_channel{

mcast_join=239.2.11.71

port=8649

bind=239.2.11.71

retry_bind=true

}

InthiscasetheIPandportare239.2.14.22/8649forthefirstserverand239.2.11.71/8649forthesecond,leadingtothesestartupproperties:

-Dflume.monitoring.type=ganglia

-Dflume.monitoring.hosts=239.2.14.22:8649,239.2.11.71:8649

Here,Iamusingdefaultsforthepollinterval,andI’malsousingthenewerGangliawireprotocol.

NoteWhilereceivingdataviaTCPissupportedinGanglia,thecurrentFlume/GangliaintegrationonlysupportssendingdatausingmulticastUDP.Ifyouhavealarge/complicatednetworksetup,you’llwanttogeteducatedbyyournetworkengineersif

Page 234: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

thingsdon’tworkasyouexpect.

Page 235: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

InternalHTTPserverYoucanconfiguretheFlumeagenttostartanHTTPserverthatwilloutputJSONthatcanusequeriesbyoutsidemechanisms.UnliketheGangliaintegration,anexternalentityhastocalltheFlumeagenttopollthedata.Intheory,youcanuseNagiostopollthisJSONdataandalertoncertainconditions,butIhavepersonallynevertriedit.Ofcourse,thissetupisveryusefulindevelopmentandtesting,especiallyifyouarewritingcustomFlumecomponentstobesuretheyaregeneratingusefulmetrics.HereisasummaryoftheJavapropertiesyou’llneedtosetatthestartupoftheFlumeagent:

Javaproperty Value Description

flume.monitoring.type http Thevalueissettohttp

flume.monitoring.port PORT ThisistheportnumbertobindtheHTTPserver

TheURLformetricswillbehttp://SERVER_OR_IP_OF_AGENT:PORT/metrics.

Let’slookatthefollowingFlumeconfiguration:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=avro

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.sources.s1.channels=c1

agent.channels.c1.type=memory

agent.sinks.k1.type=avro

agent.sinks.k1.hostname=192.168.33.33

agent.sinks.k1.port=9999

agent.sinks.k1.channel=c1

StarttheFlumeagentwiththeseproperties:

-Dflume.monitoring.type=http

-Dflume.monitoring.port=44444

Now,whenyougotohttp://SERVER_OR_IP:44444/metrics,youmightseesomethinglikethis:

{

"SOURCE.s1":{

"OpenConnectionCount":"0",

"AppendBatchAcceptedCount":"0",

"AppendBatchReceivedCount":"0",

"Type":"SOURCE",

"EventAcceptedCount":"0",

"AppendReceivedCount":"0",

"StopTime":"0",

"EventReceivedCount":"0",

"StartTime":"1365128622891",

"AppendAcceptedCount":"0"},

"CHANNEL.c1":{

Page 236: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

"EventPutSuccessCount":"0",

"ChannelFillPercentage":"0.0",

"Type":"CHANNEL",

"StopTime":"0",

"EventPutAttemptCount":"0",

"ChannelSize":"0",

"StartTime":"1365128621890",

"EventTakeSuccessCount":"0",

"ChannelCapacity":"100",

"EventTakeAttemptCount":"0"},

"SINK.k1":{

"BatchCompleteCount":"0",

"ConnectionFailedCount":"4",

"EventDrainAttemptCount":"0",

"ConnectionCreatedCount":"0",

"BatchEmptyCount":"0",

"Type":"SINK",

"ConnectionClosedCount":"0",

"EventDrainSuccessCount":"0",

"StopTime":"0",

"StartTime":"1365128622325",

"BatchUnderflowCount":"0"}

}

Asyoucansee,eachsource,sink,andchannelarebrokenoutseparatelywiththeircorrespondingmetrics.Eachtypeofsource,channel,andsinkprovidetheirownsetofmetrickeys,althoughthereissomecommonality,sobesuretocheckwhatlooksinteresting.Forinstance,thisAvrosourcehasOpenConnectionCount,thatis,thenumberofconnectedclients(whoaremostlikelysendingdatain).Thismayhelpyoudecidewhetheryouhavetheexpectednumberofclientsrelyingonthatdataor,perhaps,toomanyclients,andyouneedtostarttieringyouragents.

Generallyspeaking,thechannel’sChannelSizeorChannelFillPercentagemetricswillgiveyouagoodideawhetherthedataiscominginfasterthanitisgoingout.Itwillalsotellyouwhetheryouhaveitsetlargeenoughformaintenance/outagesofyourdatavolume.

Lookingatthesink,EventDrainSuccessCountversusEventDrainAttemptCountwilltellyouhowoftenoutputissuccessfulwhencomparedtothetimestried.Inthisexample,IamconfiguringanAvrosinktoanonexistenttarget.Asyoucansee,theConnectionFailedCountmetricisgrowing,whichisagoodindicatorofpersistentconnectionproblems.EvenagrowingConnectionCreatedCountmetriccanindicatethatconnectionsaredroppingandreopeningtoooften.

Really,therearenohardandfastrulesbesideswatchingChannelSize/ChannelFillPercentage.Eachusecasewillhaveitsownperformanceprofile,sostartsmall,setupyourmonitoring,andlearnasyougo.

Page 237: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CustommonitoringhooksIfyoualreadyhaveamonitoringsystem,youmaywanttotaketheextraefforttodevelopacustommonitoringreportingmechanism.Youmaythinkthisisassimpleasimplementingtheorg.apache.flume.instrumentation.MonitorServiceinterface.Youdoneedtodothis,butlookingattheinterface,youwillonlyseeastart()andstop()method.Unlikethemoreobviousinterceptorparadigm,theagentexpectsthatyourMonitorServiceimplementationwillstart/stopathreadtosenddataontheexpectedorconfiguredintervalifitisthetypetosenddatatoareceivingservice.Ifyouaregoingtooperateaservice,suchastheHTTPservice,thenstart/stopwouldbeusedtostartandstopyourlisteningservice.ThemetricsthemselvesarepublishedinternallytoJMXbythevarioussources,sinks,channels,andinterceptorsusingobjectnamesthatstartwithorg.apache.flume.YourimplementationwillneedtoreadthesefromMBeanServer.

NoteThebestadviceIcangiveyou,shouldyoudecidetoimplementyourown,istolookatthesourceoftwoexistingimplementations(includedinthesourcedownloadreferencedinChapter2,AQuickStartGuidetoFlume)anddowhattheydo.Touseyourmonitoringhook,settheflume.monitoring.typepropertytothefullyqualifiedclassnameofyourimplementationclass.ExpecttohavetoreworkanycustomhookswithnewFlumeversionsuntiltheframeworkmaturesandstabilizes.

Page 238: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 239: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredmonitoringFlumeagentsbothfromtheprocesslevelandthemonitoringofinternalmetrics(whetheritisworking).

MonitandNagioswereintroducedasopensourceoptionsforprocesswatching.

Next,wecoveredtheFlumeagentinternalmonitoringmetricswithGangliaandJSONoverHTTPimplementationsthatshipwithApacheFlume.

Finally,wecoveredhowtointegrateacustommonitoringimplementationifyouneedtodirectlyintegratetosomeothertoolthat’snotsupportedbyFlumebydefault.

Inourfinalchapter,wewilldiscusssomegeneralconsiderationsforyourFlumedeployment.

Page 240: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 241: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Chapter9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollectionInthislastchapter,Ithoughtweshouldcoversomeofthelessconcrete,randomthoughtsIhavearounddatacollectionintoHadoop.There’snohardsciencebehindsomeofthis,andyoushouldfeelperfectlyalrighttodisagreewithme.

WhileHadoopisagreattooltoconsumevastquantitiesofdata,Ioftenthinkofapictureofthelogjamthatoccurredin1886intheSt.CroixRiverinMinnesota(http://www.nps.gov/sacn/historyculture/stories.htm).Whendealingwithtoomuchdata,youwanttomakesureyoudon’tjamyourriver.Besuretotakethepreviouschapteronmonitoringseriouslyandnotjustasnice-to-haveinformation.

Page 242: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TransporttimeversuslogtimeIhadasituationwheredatawasbeingplacedusingdatepatternsinthefilenameand/orthepathinHDFSdidn’tmatchthecontentsofthedirectories.Theexpectationwasthatthedatainthe2014/12/29directorypathcontainedallthedataforDecember29,2014.However,therealitywasthatthedatewasbeingpulledfromthetransport.Itturnsoutthattheversionofsyslogwewereusingwasrewritingtheheader,includingthedateportion,causingthedatatotakeonthetransporttimeandnotreflecttheoriginaltimeoftherecord.Usually,theoffsetsweretiny,justasecondortwo,sonobodyreallytooknotice.However,oneday,oneoftherelayserversdiedandwhenthedatathathadgotstuckonupstreamserverswasfinallysent,ithadthecurrenttime.Inthiscase,itwasshiftedbyacoupleofdays,causingasignificantdatacleanupeffort.

Besurethisisn’thappeningtoyouifyouareplacingdatabydate.Checkthedateedgecasestoseethattheyarewhatyouexpect,andmakesureyoutestyouroutagescenariosbeforetheyhappenforrealinproduction.

AsImentionedpreviously,theseretransmitsduetoplannedorunplannedmaintenance(orevenatinynetworkhiccup)willmostlikelycauseduplicateandout-of-ordereventstoarrive,sobesuretoaccountforthiswhenprocessingrawdata.TherearenosingledeliveryororderingguaranteesinFlume.Ifyouneedthat,useatransactionaldatabaseordistributedtransactionlogsuchasApacheKafka(http://kafka.apache.org/)instead.Ofcourse,ifyouaregoingtouseKafka,youwouldprobablyonlyuseFlumeforthefinallegofyourdatapath,withyoursourceconsumingeventsfromKafka(https://github.com/baniuyao/flume-ng-kafka-source).

NoteRememberthatyoucanalwaysworkaroundduplicatesinyourdataatquerytimeaslongasyoucanuniquelyidentifyyoureventsfromoneanother.Ifyoucannotdistinguisheventseasily,youcanaddaUniversallyUniqueIdentifier(UUID)(http://en.wikipedia.org/wiki/Universally_unique_identifier)headerusingthebundledinterceptor,UUIDInterceptor(configurationdetailsareintheFlumeUserGuide).

Page 243: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 244: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

TimezonesareevilIncaseyoumissedmybiasagainstusinglocaltimeinChapter4,SinksandSinkProcessors,I’llrepeatitherealittlestronger:timezonesareevil—evillikeDr.Evil(http://en.wikipedia.org/wiki/Dr._Evil)—andlet’snotforgetabouthisMiniMecounterpart,(http://en.wikipedia.org/wiki/Mini-Me)—DaylightSavingsTime.

Weliveinaglobalworldnow.YouarepullingdatafromallovertheplaceintoyourHadoopcluster.Youmayevenhavemultipledatacentersindifferentpartsofthecountry(ortheworld).Thelastthingyouwanttobedoingwhiletryingtoanalyzeyourdataistodealwithaskewdata.DaylightSavingsTimechangesatleastsomewhereonEarthadozentimesinayear.Justlookatthehistory:ftp://ftp.iana.org/tz/releases/.SaveyourselftheheadacheandjustnormalizeittoUTC.Ifyouwanttoconvertitto“localtime”onitswaytohumaneyeballs,feelfree.However,whileitlivesinyourcluster,keepitnormalizedtoUTC.

NoteConsideradoptingUTCeverywhereviathisJavastartupparameter(ifyoucan’tsetitsystem-wide):-Duser.timezone=UTC

Also,usetheISO8601(http://en.wikipedia.org/wiki/ISO_8601)timestandardwherepossibleandbesuretoincludetimezoneinformation(evenifitisUTC).Everymoderntoolontheplanetsupportsthisformatandwillsaveyoupaindowntheroad.

IliveinChicago,andourcomputersatworkuseCentralTime,whichadjustsfordaylightsavings.InourHadoopcluster,weliketokeepdatainaYYYY/MM/DD/HHdirectorylayout.Twiceayear,somethingsbreakslightly.Inthefall,wehavetwiceasmuchdatainour2a.m.directory.Inthespring,thereisno2a.m.directory.Madness!

Page 245: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 246: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

CapacityplanningRegardlessofhowmuchdatayouthinkyouhave,thingswillchangeovertime.Newprojectswillpopupanddatacreationratesforyourexistingprojectswillchange(upordown).Datavolumewillusuallyebbandflowwiththetrafficoftheday.Finally,thenumberofserversfeedingyourHadoopclusterwillchangeovertime.

TherearemanyschoolsofthoughtonhowmuchextrastoragecapacityyoushouldkeepinyourHadoopcluster(weusethetotallyunscientificvalueof20percent,whichmeansthatweusuallyplanfor80percentfullwhenorderingadditionalhardwarebutdon’tstarttopanicuntilwehitthe85-90percentutilizationnumber).Generally,youwanttokeepenoughextraspacesothatthefailureand/ormaintenanceofaserverortwowon’tcausetheHDFSblockreplicationtoconsumealltheremainingspace.

Youmayalsoneedtosetupmultipleflowsinsideasingleagent.Thesourceandsinkprocessorsarecurrentlysingle-threaded,sothereissomelimittowhattuningbatchsizescanaccomplishwhenunderheavydatavolumes.Beverycarefulinthesesituationswhereyousplityourdataflowatthesourceusingareplicatingchannelselectortomultiplechannels/sinks.Ifoneofthepath’schannelsfillsup,anexceptionisthrownbacktothesource.Ifthatfullchannelisnotmarkedasoptionalandthedataisdropped,thesourcewillstopconsumingnewdata.Thiseffectivelyjamstheagentforallotherchannelsattachedtothatsource.Youmaynotwanttodropthedata(markingthechannelasoptional)becausethedataisimportant.Unfortunately,thisistheonlyfan-outmechanismprovidedinFlumetosendtomultipledestinations,somakesureyoucatchissuesquicklysothatallyourdataflowsarenotimpairedduetoacascadebackupofevents.

ForanumberofFlumeagentsfeedingHadoop,thistooshouldbeadjustedbasedonrealnumbers.Watchthechannelsizetoseehowwellthewritesarekeepingupundernormalloads.Adjustthemaximumchannelcapacitytohandlewhateveramountofoverheadmakesyoufeelgood.Youcanalwayspurchasewaymorehardwarethanyouneed,butevenaprolongedoutagemayoverfloweventhemostconservativeestimates.Thisiswhenyouhavetopickandchoosewhichdataismoreimportanttoyouandadjustyourchannelcapacitiestoreflectthat.Thisway,ifyouexceedyourlimits,theleastimportantdatawillbethefirsttobedropped.

Chancesareyourcompanydoesn’thaveaninfiniteamountofmoneyandatsomepoint,thevalueofthedataversusthecostofcontinuingtoexpandyourclusterwillstarttobequestioned.Thisiswhysettinglimitsonthevolumeofdatacollectedisveryimportant.Thisisjustoneaspectofyourdataretentionpolicy,wherecostisthedrivingfactor.Inamoment,we’lldiscusssomeofthecomplianceaspectsofthispolicy.Sufficetosay,anyprojectsendingdataintoHadoopshouldbeabletosaywhatthevalueofthatdataisandwhatthelossisifwedeletetheolderstuff.Thisistheonlywaythepeoplewritingthecheckscanmakeaninformeddecision.

Page 247: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 248: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ConsiderationsformultipledatacentersIfyourunyourbusinessoutofmultipledatacentersandhavealargevolumeofdatacollected,youmaywanttoconsidersettingupaHadoopclusterineachdatacenterratherthansendingallyourcollecteddatabacktoasingledatacenter.Theremayberegulatoryimplicationsregardingdatacrossingcertaingeographicboundaries.ChancesarethereissomebodyinyourcompanywhoknowsmuchmoreaboutcompliancethanyouorI,soseekthemoutbeforeyoustartcopyingdataacrossborders.Ofcourse,notcollatingyourdatawillmakeitmoredifficulttoanalyzeit,asyoucan’tjustrunoneMapReducejobagainstallthedata.Instead,youwouldhavetorunparalleljobsandthencombinetheresultsinasecondpass.Adjustingyourdataprocessingproceduresisbetterthanpotentiallybreakingthelaw.Besuretodoyourhomework.

Pullingallyourdataintoasingleclustermayalsobemorethanyournetworkingcanhandle.Dependingonhowyourdatacentersareconnectedtoeachother,yousimplymaynotbeabletotransmitthedesiredvolumeofdata.Ifyouusepubliccloudservices,therearesurelydatatransfercostsbetweendatacenters.Finally,considerthatacompleteclusterfailureorcorruptionmaywipeouteverything,asmostclustersareusuallytoobigtobackupeverythingexcepthighvaluedata.Havingsomeoftheolddatainthiscaseissometimesbetterthanhavingnothing.WithmultipleHadoopclusters,youhavetheabilitytouseaFailoverSinkProcessortoforwarddatatoadifferentclusterifyoudon’twanttowaittosendtothelocalone.

Ifyoudochoosetosendallyourdatatoasingledestination,consideraddingalargediskcapacitymachineasarelayserverforthedatacenter.Thisway,ifthereisacommunicationissueorextendedclustermaintenance,youcanletdatapileuponamachinethat’sdifferentfromtheonestryingtoserviceyourcustomers.Thisissoundadviceeveninasingledatacentersituation.

Page 249: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 250: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ComplianceanddataexpiryRememberthatthedatayourcompanyiscollectingfromyourcustomersshouldbeconsideredsensitiveinformation.Youmaybeboundbyadditionalregulatorylimitationsonaccessingdatasuchas:

Personallyidentifiableinformation(PII):Howyouhandleandsafeguardcustomer’sidentitieshttp://en.wikipedia.org/wiki/Personally_identifiable_informationPaymentCardIndustryDataSecurityStandard(PCIDSS):Howyousafeguardcreditcardinformationhttp://en.wikipedia.org/wiki/PCI_DSSServiceOrganizationControl(SOC-2):Howyoucontrolaccesstoinformation/systemshttp://www.aicpa.org/InterestAreas/FRC/AssuranceAdvisoryServices/Pages/AICPASOC2Report.aspxStatementsonStandardsforAttestationEngagements(SSAE-16):Howyoumanagechangeshttp://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/AT-00801.pdfSarbanesOxley(SOX):http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act

Thisisbynomeansadefinitivelist,sobesuretoseekoutyourcompany’scomplianceexpertsforwhatdoesanddoesn’tapplytoyoursituation.Ifyouaren’tproperlyhandlingaccesstothisdatainyourcluster,thegovernmentwillleanonyou,orworse,youwon’thavecustomersanymoreiftheyfeelyouaren’tprotectingtheirpersonalinformation.Considerscrambling,trimming,orobfuscatingyourdataofpersonalinformation.Chancesarethebusinessinsightyouarelookingfallsmoreintothecategoryof“howmanypeoplewhosearchfor“hammer”actuallybuyone?”ratherthan“howmanycustomersarenamedBob?”AsyousawinChapter6,Interceptors,ETL,andRouting,itwouldbeveryeasytowriteaninterceptortoobfuscatePIIasyoumoveitaround.

YourcompanyprobablyhasadocumentretentionpolicythatincludesthedatayouareputtingintoHadoop.Makesureyouremovedatathatyourpolicysaysyouaren’tsupposedtobekeepingaroundanymore.Thelastthingyouwantisavisitfromthelawyers.

Page 251: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second
Page 252: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SummaryInthischapter,wecoveredseveralreal-worldconsiderationsyouneedtothinkaboutwhenplanningyourFlumeimplementation,including:

TransporttimedoesnotalwaysmatcheventtimeThemayhemintroducedwithDaylightSavingsTimetocertaintime-basedlogicCapacityplanningconsiderationsItemstoconsiderwhenyouhavemorethanonedatacenterDatacomplianceDataretentionandexpiration

Ihopeyouenjoyedthisbook.Hopefully,youwillbeabletoapplymuchofthisinformationdirectlyinyourapplication/Hadoopintegrationefforts.

Thanks,thiswasfun!

Page 253: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

IndexA

ActiveMQURL/JMSsource

agent/Sources,channels,andsinksagentcommand/Startingupwith“Hello,World!”agentidentifier(name)/AnoverviewoftheFlumeconfigurationfileagentprocess

monitoring/Monitoringtheagentprocessagentprocess,monitoring

Monit/MonitNagios/Nagios

AOPSpringFramework/Interceptors,channelselectors,andsinkprocessorsApacheAvro

about/ApacheAvroURL/ApacheAvro

ApachebenchmarkURL/Settingupthewebserver

ApacheBigTopprojectURL/FlumeinHadoopdistributions

ApacheFlumeURL/Thememorychannel

ApacheKafkaURL/Transporttimeversuslogtime

Avrosource/sinkused,fortieringdataflows/TheAvrosource/sink,TheThriftsource/sinkcompressedAvro/CompressingAvro

avro_eventserializer/ApacheAvro

Page 254: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

BbasenameHeaderKeyproperty/SpoolingDirectorySourcebatchDurationMillisproperty/SinkconfigurationbatchSizeproperty/SinkconfigurationbatchTimeoutproperty/TheExecsourceBesteffort(BE)/Flume0.9byteCapacity

URL/ThememorychannelbyteCapacityBufferPercentage

URL/Thememorychannel

Page 255: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Ccapacityplanning/CapacityplanningCDH3distribution/Flume0.9CDH5

URL/ArchivingtoHDFScentralizedmaster(masters)/Flume0.9cf-enginetool/Flume1.X(Flume-NG)ChannelProcessorfunction/Thememorychannelchannels/Sources,channels,andsinkschannelselector

about/Channelselectorsreplicating/Replicatingmultiplexing/Multiplexing

channelselectors/Interceptors,channelselectors,andsinkprocessorscheckpointIntervalproperty/ThefilechannelCheftool/Flume1.X(Flume-NG)Cloudera

URL/FlumeinHadoopdistributionsClouderadistribution

issuesolution,URL/AnoverviewoftheFlumeconfigurationfilecodecs

about/Compressioncodecscommand-lineAvro

about/Usingcommand-lineAvrocompliance/ComplianceanddataexpiryCompressedStream/CompressedStreamconsumeOrderproperty/SpoolingDirectorySourcecrondaemon

URL/Configuringlogrotationtothespooldirectorycustominterceptors

about/Custominterceptorspluginsdirectory/Thepluginsdirectory

custommonitoringreportingmechanismdeveloping/Custommonitoringhooks

Page 256: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

D-Dflume.root.loggerproperty/Startingupwith“Hello,World!”data/logs

streaming/TheproblemwithHDFSandstreamingdata/logsdataexpiry/Complianceanddataexpirydataflows

tiering/Tieringflowsdataflows,tiering

Avrosource/sink,using/TheAvrosource/sinkSSLAvro/SSLAvroflowsThriftsource/sink,using/TheThriftsource/sinkcommand-lineAvro/Usingcommand-lineAvroLog4Jappender/TheLog4JappenderLog4Jload-balancingappender/TheLog4Jload-balancingappender

DataStream/DataStreamdestinationNameproperty/JMSsourceDiskFailover(DFO)/Flume0.9

Page 257: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

EElasticComputeCluster(EC2)/WeblogstosearchableUIElasticSearch

about/ElasticSearchSinkURL/ElasticSearchSink,Settingupthetarget–Elasticsearch,Settingupabetteruserinterface–KibanaversusApacheSolr/DynamicSerializersettingup/Settingupthetarget–Elasticsearch

ElasticSearchDynamicSerializerserializer/DynamicSerializerElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchSink

about/ElasticSearchSinksettings/ElasticSearchSinkElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchDynamicSerializerserializer/DynamicSerializer

EmbeddedAgentabout/Theembeddedagentconfiguration/Configurationandstartupstartup/Configurationandstartupalternativeformats,URL/Configurationandstartupdata,sending/Sendingdatashutdown/Shutdown

End-to-End(E2E)/Flume0.9EventSerializer

about/EventSerializerstextoutput/Textoutputtext_with_headersserializer/TextwithheadersApacheAvro/ApacheAvrouser-providedAvroschema/User-providedAvroschemafiletype/Filetypetimeouts/Timeoutsandworkersworkers/Timeoutsandworkers

Execsourceabout/TheExecsourceproperties/TheExecsource

Page 258: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Ffilechannel

about/Thefilechannelconfigurationparameters/Thefilechannel

filetypeabout/FiletypeSequenceFile/SequenceFileDataStream/DataStreamCompressedStream/CompressedStream

Flumeabout/Flume0.9events/FlumeeventsURL/DownloadingFlumedownloading/DownloadingFlumeinHadoopdistributions/FlumeinHadoopdistributionsconfigurationfile/AnoverviewoftheFlumeconfigurationfilesettingup,oncollector/relay/SettingupFlumeoncollector/relaysettingup,onclient/SettingupFlumeontheclientagentprocess,monitoring/Monitoringtheagentprocess

Flume-NG(FlumetheNextGeneration)/Flume0.9flume-ng-kafka-source

URL/TransporttimeversuslogtimeFlume0.9/Flume0.9Flume1.X(Flume-NG)/Flume1.X(Flume-NG)

URL/Flume1.X(Flume-NG)Flumeconfigurationfile

overview/AnoverviewoftheFlumeconfigurationfileFlumeevents

about/Flumeeventsinterceptors/Interceptors,channelselectors,andsinkprocessorschannelselectors/Interceptors,channelselectors,andsinkprocessorssinkprocessors/Interceptors,channelselectors,andsinkprocessors

FlumeJVMURL/Monitoringperformancemetrics

FlumeUserGuideURL/Thefilechannel,HDFSsink

Page 259: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

GGanglia

about/GangliaURL/Ganglia

grokcommandURL/TheKiteSDK

Page 260: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

HHadoopdistributions

Flume/FlumeinHadoopdistributionsbenefits/FlumeinHadoopdistributionslimitations/FlumeinHadoopdistributions

HDFSissue/TheproblemwithHDFSandstreamingdata/logsarchivingto/ArchivingtoHDFS

hdfs.batchSizeparameter/Filerotationhdfs.maxOpenFilesproperty/Pathandfilenamehdfs.timeZoneproperty/Pathandfilenamehdfs.useLocalTimeStampbooleanproperty/PathandfilenameHDFSsink

about/HDFSsinkconfigurationparameters/HDFSsinkpath/Pathandfilenamefilename/Pathandfilenamefilerotation/Filerotationcompressioncodecs/Compressioncodecs

Hello,World!examplefileconfiguration/Startingupwith“Hello,World!”

helpcommand/Startingupwith“Hello,World!”Hortonworks

URL/FlumeinHadoopdistributionsHostinterceptor

about/Hostproperties/Host

HumanOptimizedConfigurationObjectNotation(HOCON)URL/Morphlineconfigurationfiles

Page 261: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

IindexNameproperty/ElasticSearchSinkinterceptor/Interceptors,channelselectors,andsinkprocessors

used,forcreatingsearchfields/Creatingmoresearchfieldswithaninterceptorinterceptors

about/Interceptorsadding/InterceptorsTimestamp/TimestampHost/HostStatic/Staticregularexpressionfiltering/Regularexpressionfilteringregularexpression/RegularexpressionextractorMorphline/Morphlineinterceptorcustom/Custominterceptors

internalHTTPserverusing/InternalHTTPserver

ISO8601URL/Timezonesareevil

Page 262: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

JJavaKeyStore(JKS)/SSLAvroflowsJavaproperties

flume.monitoring.type/Gangliaflume.monitoring.hosts/Gangliaflume.monitoring.isGanglia3/Ganglia

JMSmessageselectorsURL/JMSsource

JMSsourceabout/JMSsourceconfiguring/JMSsourcesettings/JMSsource

Page 263: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Kkeep-aliveparameter/ThefilechannelkeepFieldsproperty/ThesyslogUDPsource,ThesyslogTCPsourceKibana

URL/ElasticSearchSink,Settingupabetteruserinterface–Kibanasettingup/Settingupabetteruserinterface–Kibana

KiteSDKabout/TheKiteSDKURL/TheKiteSDK

Page 264: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

LLog4Jappender

about/TheLog4JappenderURL/TheLog4Jappender

Log4Jload-balancingappenderabout/TheLog4Jload-balancingappender

logrotateutilityURL/Configuringlogrotationtothespooldirectory

LogstashURL/ElasticSearchSink

logtimeversustransporttime/Transporttimeversuslogtime

Page 265: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

MMapR

URL/FlumeinHadoopdistributionsmemoryCapacityproperty/SpillableMemoryChannelmemorychannel

about/Thememorychannelconfigurationparameters/Thememorychannel

metricsURL/InternalHTTPserver

MiniMecounterpart/TimezonesareevilminimumRequiredSpaceproperty/ThefilechannelMonit

URL/Monitabout/Monit

Morphlineconfigurationfile/MorphlineconfigurationfilesURL/Morphlineconfigurationfiles,TypicalSolrSinkconfiguration,Morphlineinterceptor

MorphlinecommandsURL/TheKiteSDK

morphlineIdproperty/SinkconfigurationMorphlineinterceptor

about/MorphlineinterceptorMorphlineSolrSink

about/MorphlineSolrSinkMorphlineconfigurationfile/MorphlineconfigurationfilesSolrSinkconfiguration/TypicalSolrSinkconfigurationsinkconfiguration/SinkconfigurationSinkconfiguration/Sinkconfiguration

multipledatacentersconsiderations/Considerationsformultipledatacenters

MultiportSyslogTCPsourceabout/ThemultiportsyslogTCPsourceproperties/ThemultiportsyslogTCPsourceFlumeheaders/ThemultiportsyslogTCPsource

Page 266: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

NNagios

URL/Nagiosabout/Nagios

NagiosJMXpluginURL/Monitoringperformancemetrics

nccommand/Startingupwith“Hello,World!”Nginxwebserver

URL/Settingupthewebserver

Page 267: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

OoverflowCapacityproperty/SpillableMemoryChanneloverflowDeactivationThresholdproperty/SpillableMemoryChanneloverflowTimeoutproperty/SpillableMemoryChannel

Page 268: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

PPaymentCardIndustryDataSecurityStandard(PCIDSS)

URL/Complianceanddataexpiryperformancemetrics

monitoring/Monitoringperformancemetricsperformancemetrics,monitoring

Ganglia/GangliainternalHTTPserver/InternalHTTPservercustommonitoringhooks/Custommonitoringhooks

Personallyidentifiableinformation(PII)URL/Complianceanddataexpiry

pluginsdirectoryabout/Thepluginsdirectory$FLUME_HOME/plugins.ddirectory/Thepluginsdirectorylibdirectory/Thepluginsdirectorynativedirectory/Thepluginsdirectory

pollTimeoutproperty/JMSsourcepom.xmlfile

URL/DynamicSerializerPOSIX-stylefilesystem/TheproblemwithHDFSandstreamingdata/logsprocessor.backoffproperty/LoadbalancingPuppettool/Flume1.X(Flume-NG)

Page 269: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

RRedHatEnterpriseLinux(RHEL)/FlumeinHadoopdistributionsregularexpressionextractorinterceptor

about/Regularexpressionextractorproperties/Regularexpressionextractor

regularexpressionfilteringinterceptorabout/RegularexpressionfilteringURL/Regularexpressionfiltering

routing/RoutingRunners/Flume1.X(Flume-NG)

Page 270: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

SSarbanesOxley(SOX)

URL/ComplianceanddataexpirySequenceFilefiletype/SequenceFileserializer.syncIntervalBytesproperty/ApacheAvroserializers/Regularexpressionextractorsinkgroup

about/Sinkgroupsloadbalancing/Loadbalancingfailover/Failover

SinkProcessorabout/Sinkgroups

sinkprocessors/Interceptors,channelselectors,andsinkprocessorssinks/Sources,channels,andsinksSolr/MorphlineSolrSinkSolrCloud/MorphlineSolrSinkSolrSink

configuration/TypicalSolrSinkconfigurationsources/Sources,channels,andsinksSpillableMemoryChannel

about/SpillableMemoryChannelconfigurationparameters/SpillableMemoryChannel

spooldirectorylogrotation,configuring/Configuringlogrotationtothespooldirectory

SpoolingDirectorySourceabout/SpoolingDirectorySourcecreating/SpoolingDirectorySourceproperties/SpoolingDirectorySource

SpringURL/Configurationandstartup

start()method/CustommonitoringhooksStatementsonStandardsforAttestationEngagements(SSAE-16)

URL/ComplianceanddataexpiryStaticinterceptor

about/Staticproperties/Static

syslogsourcesabout/SyslogsourcesURL/SyslogsourcesUDPsource/ThesyslogUDPsourceTCPsource/ThesyslogTCPsourceMultiportSyslogTCPsource/ThemultiportsyslogTCPsource

SyslogTCPsource

Page 271: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

about/ThesyslogTCPsourcecreating/ThesyslogTCPsourceFlumeheaders/ThesyslogTCPsource

syslogtermsURL/SyslogsourcesSyslogUDPsource

about/ThesyslogUDPsourceproperties/ThesyslogUDPsourceFlumeheaders/ThesyslogUDPsource

Page 272: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Ttail

URL/Theproblemwithusingtailabout/Theproblemwithusingtailissues/Theproblemwithusingtail

tail-Fcommand/TheExecsourcetake/Thememorychanneltext_with_headersserializer/TextwithheadersThrift

URL/TheThriftsource/sinkThriftsource/sink

used,fortieringdataflows/TheThriftsource/sinktiereddatacollection/Tiereddatacollection(multipleflowsand/oragents)Timestampinterceptor

about/Timestampproperties/Timestamp

timestampkey/Regularexpressionextractortimezones/TimezonesareeviltransactionCapacityproperty/Thefilechannel,SpillableMemoryChanneltransporttime

versuslogtime/Transporttimeversuslogtime

Page 273: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

UUniversallyUniqueIdentifier(UUID)

URL/Transporttimeversuslogtimeuser-providedAvroschema/User-providedAvroschema

Page 274: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

VVeriSign/SSLAvroflows

Page 275: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

Wwebapplication

simulating/WeblogstosearchableUIwebserver

settingup/Settingupthewebserverlogrotation,configuringtospooldirectory/Configuringlogrotationtothespooldirectory

workerdaemons(agents)/Flume0.9WriteAheadLog(WAL)/Thefilechannelwrk

URL/Settingupthewebserver

Page 276: Apache Flume: Distributed Log Collection - javaarmjavaarm.com/.../Apache.Flume-Distributed.Log.Collection.for.Hadoop... · Apache Flume: Distributed Log Collection for Hadoop Second

ZZookeeper/Flume0.9