GreatIdeas….SimpleSolutions
DataIngestionPlatform(DiP)Neeraj Sabharwal@allaboutbdata
Aboutme
XavientCorporateOverview2
• HeadofCloud,Data&Analytics@Xavient• Spentcoupleofyears@Hortonworks• OveradecadeinCloud&Datadomain• StartedcareerasOracleDBA
Disclosure–Morememescomingup…
Agenda
XavientCorporateOverview3
Platform
DataAccess
HybridCloud
Data IngestionPlatform (DiP)4
Beforewestart…
**NearrealtimeisokasIameasygoingbutnomorehoursordayswaitondata
Problem
XavientCorporateOverview5
UI/API Platform DataAccess
No…nearreal-timeaccess
Cloud
GreatIdeas….SimpleSolutions
Shiftingthegear– Let’sgettechnical
StreamingBlueprint
XavientCorporateOverview7
DataCollection MessagingTier StreamingEngine AnalysisTierInmemoryDataStore DataAccess
**NearrealtimeisokasIameasygoingbutnomorehoursordayswaitondata
MessagingBus
XavientCorporateOverview8
• Open-sourcemessagebroker• Unified,high-throughput,low-latencyplatformforhandlingreal-timedatafeeds• Massivelyscalablepub/submessagequeuearchitectedasadistributedtransactionlog
Emotions
XavientCorporateOverview9
Streamingengines
XavientCorporateOverview10
Storm - Distributedreal-timecomputationsystemforprocessinglargevolumesofhigh-velocitydata
Flink - Streamingdataflowengine thatprovidesdatadistribution,communication,andfaulttolerancefordistributedcomputationsoverdatastreams
Apex - Enterprise-gradeunifiedstreamandbatchprocessingengine
SparkStreaming- ApacheSpark's language-integratedAPI tostreamprocessing,lettingyouwritestreamingjobsthesamewayyouwritebatchjobs.ItsupportsJava,ScalaandPython
CTM
XavientCorporateOverview11
GreatIdeas….SimpleSolutions
Platform(DiP)
Data IngestionPlatform (DiP)13
Features
EasytouseUI
MultipleStreamingEngines
Supportsxml,jsonandtsv dataformats
ManualdataentryviaUI
Uploadfilesforbatchprocessing
HybridCloud
BatchandRealtimeviewsofdata
Datavisualizationandanalytics
YARNfeaturesDataIngestionPlatform
Data IngestionPlatform (DiP)14
UseCases– AnyData
SentimentalAnalysis LogAnalysis
ClickStreamAnalysis AnalyzeMachineandSensorData
SocialMediaandCustomerSentiment
UI
XavientCorporateOverview15
https://techblog.xavient.com/
Whatwasinthepreviousslide?Isthatforreal?
XavientCorporateOverview16
NomoreMemes…EnoughnowJ
Data IngestionPlatform (DiP)17
DiPTechnologyStack
MessagingSystem
TargetSystem
ReportingSystem
SourceSystem
StreamingAPI’s
ProgrammingLanguage
IDE
Buildtool
OperatingSystem
ApacheKafka
HDFS,NoSql,ApacheHive
ApachePhoenix,ApacheZeppelin
WebClient
ApacheApex,ApacheFlink,Apache SparkandApacheStorm
Java
Eclipse
ApacheMaven
CentOS7
Data IngestionPlatform (DiP)18
DiPHighLevelArchitecture
Data IngestionPlatform (DiP)19
DiPusingStorm
• Multipleprocessingparadigm - Real-time,InteractiveandBatchprocesses• Reliable – eachunitofdata(tuple)willbeprocessedatleastonceorexactlyonce.• Fast andscalable- parallelcalculationsarerunacrossaclusterofmachines.• Fault-tolerant - workersautomaticallyrestartsincasetheydie.
ApacheStormfeatures
Data IngestionPlatform (DiP)20
DiPusingSparkStreaming
• Multipleprocessingparadigm - BatchandInteractive• EaseofUse– containshigh-leveloperatorswritteninJava,ScalaandPython• FaultTolerance- lostworkandoperatorstatecanberecoveredwithnoextracode• CodeReusability– samecodecanbeusedforbatchprocessing,joinstreamsagainsthistoricaldata,ortorunad-
hocqueriesonstreamstate
SparkStreaming features
Data IngestionPlatform (DiP)21
DiPusingApex
Modular - Malhar,alibraryofoperators,comesbundledwithApexforquickdevelopmentcycles• Supportsboth streamandbatchprocessing• Supportsoperatorexchangeatruntime• Supportsfaulttoleranceanddynamicscaling
ApacheApex features
Data IngestionPlatform (DiP)22
DiPusingFlink
Multipleprocessingparadigm - distributed,streamandbatchprocessing.SeveralAPIs forcreatingapplicationsaresupported
• DataStreamAPI forunboundedstreamsembeddedinJavaandScala• DataSetAPI forstaticdataembeddedinJava,Scala,andPython,• TableAPIwithaSQL-likeexpressionlanguageembeddedinJavaandScala.
Faulttolerancefordistributedcomputationsoverdatastreams
ApacheFlink features
Data IngestionPlatform (DiP)23
DiP-DruidArchitecture(HighLevel)
Credit:https://imply.io/docs/latest/
https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data
Data IngestionPlatform (DiP)24
DataAccess
ApacheZeppelin/CustomUI
• DataStoredonHDFSasHiveExternalTables
• DatastoredonHBaseasPhoenixView
CustomUI“Co-Dev”
XavientCorporateOverview25
• Integratedwithelasticsearch
• EnterprisesecurityandSSO
• Recommendationmodelbasedonuserprofile,tagsandactivity
• Chat• Blog/Dropletfeatures• Taskscreationandfollow-up
• Notifications• Smartphoneapp
Data IngestionPlatform (DiP)26
Data IngestionPlatform (DiP)27
Getinvolved
https://github.com/XavientInformationSystems/Data-Ingestion-Platform
Co-Dev:Reachoutincaseyouwanttocustomizetheplatform,choosetherightstreamingenginebasedonlatency,usecaseandcustomUI/reporting.
GreatIdeas….SimpleSolutions
HybridCloud
HadoopandCloud
XavientCorporateOverview29
ApacheFalcon
XavientCorporateOverview30
DiP Hadoop
On-prem Cloud
ApacheFalcon isadatamanagementtoolforoverseeingdatapipelinesinHadoopclusters.Itcanbeusedtoreplicatedatafromoneclustertoanother.
Hadoop
KafkaMirroring
XavientCorporateOverview31
The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the replication of an active datacenter into a passive datacenter. Kafka provides a mirror maker tool for mirroring the source cluster into target cluster.
Data IngestionPlatform (DiP)32
KafkaMirroring– HybridCloudEnvironment
Cassandra
XavientCorporateOverview33
DiP
Cassandra
Cassandra
On-prem Cloud
• RDBMSmigration• DSEadvancereplication• Kafka
Data IngestionPlatform (DiP)34
WIP
• IntegrationwithKafkaConnectandKafkaStreaming• DataMunging,Validation• MachineLearning• Search– Elastic,Solr
Thanks!@[email protected]