Upload
cyrus-new
View
161
Download
0
Embed Size (px)
Citation preview
Motivation
TheEvolutionofMassive-ScaleDataProcessingTylerAkidau,StaffSoftwareEngineer@Googlehttps://goo.gl/5k0xaL
TheEvolutionofMassive-ScaleDataProcessingTylerAkidau,StaffSoftwareEngineer@Googlehttps://goo.gl/5k0xaL
We’renotthenewest!
APACHEFLINKLONDONMEETUP3rdMarch2016|BonhillHouse,London
Whatwe’llcovertoday
¨ Hand-waveybit¨ Practicalbit¨ Textbookbit
Part1:Thehand-waveybit
¨ Aim:¤ MakesureweallhavesamebasicunderstandingofwhatFlinkis
¤ Introducekeyconceptsn Notexhaustiven Notexplainingmuch!
WTFisFlink?
Flinkbasics…
¨ ApacheFoundationtoplevelopensourceproject…¨ …fordistributeddataprocessing…¨ …witha“streamingfirst”architecture…¨ …runningontheJVM.Or:A‘free’waytoprocessalotofdata(especiallystreamingdata)on‘commodity’hardware,withacodebasethatiscontinuallyimproving.Usefulforreporting,analytics,logprocessing,machinelearning,etc.
Somekeyterms
¨ DataStreamApossiblyunboundedimmutablecollectionofdataitemsofthesametype
¨ DataSetAnabstractrepresentationofafiniteimmutablecollectionofdataofthesametypethatmaycontainduplicates
¨ SourceCanbefile-based,socket-based,collection-based,Custom(e.g.Kaea)
¨ SinkConsumesDataSets/DataStreamsandforwardsthemtofiles,sockets,externalsystems,orprintsthem
¨ OperatorRepresentsanoperation(oradataprocessingstep)inthe‘JobGraph’–includespropertiesliketheactualcodeanddesiredparallelism.
Applicationarchitecture
Flink‘skeleton’programstructure
DataStream1. Obtaina
StreamExecutionEnvironment
2. Connecttodatastreamsources
3. Specifytransformationsonthedatastreams
4. Specifyoutputfortheprocesseddata
5. Executetheprogram[env.execute()]
DataSet1. Obtainan
ExecutionEnvironment2. Load/createtheinitial
data3. Specifytransformations
onthedata4. Specifywheretoput
results5. Executetheprogram
[env.execute(), print(), collect()]
(infuturemeetupsGuestSpeakerswillgiveusthejuicydetails…)
KeyFlinkfeatures
High Performance
Support for out-of-order events
Low latency
Exactly-once semantics
Flexible streaming windows
One runtime for stream & batch /
ecosystem
Back pressure
Delta iterate operators
One runtime for stream & batch /
ecosystem
Delta iterate operators
High Performance
Support for out-of-order events
Low latency
Exactly-once semantics
Flexible streaming windows
Back pressure
AccordingtotheApacheFlinksite
(http://flink.apache.org/)
Highperformance/Lowlatency
Highthroughput
Lowlatency
Flowcontrolandbackpressure
¨ Backpressurebottleneck:‘pressure’buildingupbecausedataisarrivingfasterthanitcanbeprocessed.¤ Temporaryprocessslow-down(e.g.GConJVM)¤ Temporarytrafficspike
¨ “Flinkachievesthemaximumthroughputallowedbytheslowestpartofthepipeline”¤ Notaconfigurable‘feature’¤ Inherentinarchitecture(buffer-based)
Exactly-oncesemanticsforstate
¨ Intheeventoffailure“Pickupwhereyouleftoff”.¤ Meansyouneedtorememberwhereyouleftoff(dataandstate)
¨ 3levelsofriskappetite:¤ L1–Acceptmisses(“Atmostonce”)¤ L2–Acceptduplicates(“Atleastonce”)¤ L3–Don’taccepteither(“Exactlyonce”)
¨ Checkpointing/snapshots¤ Dependentonstreamsource–e.g.Kaea¤ Orchestrationistricky(seenextslide)
Checkpointingorchestration
SupportforOut-of-Orderevents
¨ Reallife:messageswillbedelayed
¨ Everyeventistime-stamped¨ It’sharderthanitsounds(‘kinds’oftime,windows,watermarks,etc)
t1t2t3t5t6t7t4t8
Highlyflexiblestreamingwindows
Thestartandendofthedatastreamthatisbeingprocessed.¨ Differentwaystodefinethewindow,including:
¤ Time(from9:00:00to9:00:04)¤ Count(fromitem12toitem18)¤ Session(fromfirst‘keyedevent’untilwedon’tseesame
keyforXtime–analogoustocookiesession)¤ Morecomplexlogicdrivenbythedata,andmore
complexwindowsdependingonwhatisneeded
(Delta)iterateoperators
Iterateoperator
Deltaiterateoperator
Workon‘hot’Don’ttouch‘cold’
(Delta)iterateoperators
Oneruntime/libraryecosystem
NB:• Librariesinbeta• APIsinJava,Scala,[Python]• FlinkCEPtoo?