35
Enterprise Metadata Integration Mirko Kämpf | Cloudera GraphConnect 2017 – London

Enterprise Metadata Integration

Embed Size (px)

Citation preview

1©Cloudera,Inc.Allrightsreserved.

EnterpriseMetadataIntegrationMirko Kämpf |Cloudera

GraphConnect 2017– London

2©Cloudera,Inc.Allrightsreserved.

Whoisspeaking?SolutionsArchitect@Cloudera

-timeseriesanalysis,networkanalysis,dataenrichmentpipelines-personalinterest:QA-Systemsandsemanticsearch

DataScienceActivitiesTheDetectionofEmergingTrendsUsingWikipediaTrafficDataandContextNetworks(PLOSONE,2015)

Hadoop.TS (IJCA,2013)

Fluctuations inWikipediaAccess-RateandEdit-EventData.(Physica A,2012).

3©Cloudera,Inc.Allrightsreserved.

OurApproach:MultilayerMetadataIntegration…

• StatusdashboardsareprovidedperTopic/Use-Case.• Eachdashboardoffersfactsfrommultiplelayers:- (L1)Clusterspecificmetadata- (L2)Hadoopspecificops-metadata(only)- (L3)Applicationspecificops-metadata- (L4)Qualitymetricsandderivedfacts

• CurrentProjectStatus:• GraphdatabaseNeo4J andCypherallowcontextexploration.• Clusterspanningmetadataexplorationispossible.• Exposureofinherentbutsometimeshiddenfacts becomesaseasyaswritinganemail.

Integrationoffactstogainbusinessknowledge

4©Cloudera,Inc.Allrightsreserved.

Agenda

EMI- EnterpriseMetadataIntegration• Idea&Vision• Material• Skills/Methods• Tools

5©Cloudera,Inc.Allrightsreserved.

HowToBecomeDataDriven?Treat“dataasaresource“foryourbusiness.Thinkintermsofdatasetlifecycles.

6©Cloudera,Inc.Allrightsreserved.

Peopledomining…forcenturies!

http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html

gold&diamonds,ore&coal,minerals,oil…

Outcomedriveswholeeconomy

7©Cloudera,Inc.Allrightsreserved.

Peopleusecomputers…fordecades!

1938Z1:World’s firstfreeprogrammabledevice,createdbyConradZuse.

U.S.Departmentof Energy uses IntelSupercomputer atArgonne NationalLaboratory.

2015

http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png

http://www.horst-zuse.homepage.t-online.de/z1.html

8©Cloudera,Inc.Allrightsreserved.

DATA

MINING

http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/Blog: About Learning Data Mining & Data Analysis

9©Cloudera,Inc.Allrightsreserved.

Ifdataisthenewoil…

…metadataarenuggetsandbrilliantsofourage.

Screenshot takenfrom:https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil

10©Cloudera,Inc.Allrightsreserved.

Diamonds: arebeautifulevenasrawmaterial.

Brilliant: isaresultofexpert’swork.Youhavetocutandgrind it!

Evenmoreexcitingincombinationwithothermaterialandskills…

Processoptimization

Requiresknowledgegatheringandtransfer.

11©Cloudera,Inc.Allrightsreserved.

• Idea&Vision•Material• Skills/Methods• Tools

SuccessFactors:

http://www.burkhard-beyer.net/Reportage_Goldschmied.html

12©Cloudera,Inc.Allrightsreserved.

• Idea&Vision•Material• Skills/Methods• Tools

SuccessFactors:

http://www.burkhard-beyer.net/Reportage_Goldschmied.html

Toolsandprocessesevolve…...successcriteriahavebeenstable.

13©Cloudera,Inc.Allrightsreserved.

Let’sThinkDataDriven!

•Buildalong-termstrategy!

Notthefancytoolsetbutratheryourdata iswhatmattersmost!

• Afterinitialsuccessyoushouldcarefullycontrolspeedofexpansion.•Maximizeaccessibilityofdata!

Example:Google’sgoalwastomakethedataoftheinternetaccessible.YoushouldbecomeyourownGoogle!

• Idea&Vision• Material• Skills/Methods• Tools

14©Cloudera,Inc.Allrightsreserved.

DatasetProfiles/FlowDescriptors

•Ourmaterialisdata&metadata:

- Dataaboutdata:descriptivedata,Dublincoremetadatamodel,…- Deriveddata:statisticsextractedfromprocesses,documents,…- ResultsofML/AIprocedures:extractedstructureandlearnedmodels- Outcomeofcrowdbasedoperations:Wikipedia withitsinherentstructure,communicationlogs,accessandedithistory.

• Idea&Vision• Material• Skills/Methods• Tools

15©Cloudera,Inc.Allrightsreserved.

KnowledgeExtractionforBetterDataScience

16©Cloudera,Inc.Allrightsreserved.

Science:

AccordingtoWikipedia:

Scienceisasystematicenterprisethatbuildsandorganizesknowledge intheformoftestableexplanationsand predictions aboutthe universe.

https://en.wikipedia.org/wiki/Science

17©Cloudera,Inc.Allrightsreserved.

DataScience:

Myobservation:

Data Scienceisasystematicenterprisethatbuildsandorganizesknowledge intheformoftestable explanations andpredictions about themarketandbusinesscontext.

https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif

18©Cloudera,Inc.Allrightsreserved.

Details

Lookintonature….

19©Cloudera,Inc.Allrightsreserved.

Context

Lookintonature….

20©Cloudera,Inc.Allrightsreserved.

Result:VisualizationofFacts• Animageshowswhatthetextsays.>Multi-channelcommunication

• DataSciencebenefitsfromsuchanapproach.>Todaywestilluseinfographics

Difference:Biologistwhocreatedtheimageontheleftobservedbyeye.

Today,datascientists,lookmoreintodatathanintonature.

21©Cloudera,Inc.Allrightsreserved.

Process:KnowledgeExtractionisaNaturalProcess

• Combinemultiplesources

• Repeatobservation

• Incorporatecontexttoexplaindifferences/variation

• Cross-checkstoidentifyanomalies

22©Cloudera,Inc.Allrightsreserved.

Process:KnowledgeExtractionisaNaturalProcessKnowledge

Facts

Data

23©Cloudera,Inc.Allrightsreserved.

HowdidweimplementEMDM?

- HadoopBased:forscalability.

- OpenGraphDataModel:forflexibilityandconnectivity

- DataCentric:followingtheBigDataparadigm

24©Cloudera,Inc.Allrightsreserved.

BigDataProcessing:e.g.,withHadoop

25©Cloudera,Inc.Allrightsreserved.

BigGraphProcessingonHadoop:e.g.,withGiraph

26©Cloudera,Inc.Allrightsreserved.

ProjectNameshouldstandfor:

Graphs,Hadoop,andtheecosystem…

27©Cloudera,Inc.Allrightsreserved.

ProjectNameshouldstandfor:

Graphs,Hadoop,andtheecosystem…

28©Cloudera,Inc.Allrightsreserved.

DataScienceProcessModel(DSPM)

• DSPMdefinescoreartifactsforknowledgemanagement• Describesanalysis/transformationcontext• Allowsrepeatableexecution• Processpropertiesbecomemeasurable• Supportscomparisonofresultsfrommultipleprocedures

• Allthosefactsareessentialingredientstobusinessoptimization.• But:Logging&tracking shouldneverblockcreativity!• Remember:Scientistsoftenactlikeartists.

• Idea&Vision• Material• Skills/Methods• Tools

ToolboxandManagementMethods

29©Cloudera,Inc.Allrightsreserved.

DataScienceProcessModel(DSPM)• Idea&Vision• Material• Skills/Methods• Tools

Representationofdomainknowledge(inourcaseitisdatascienceingeneral)

HumanInteraction

Ontology ToolboxandManagementMethods

AbilitytosolveaproblemusingITanddata

TechnologyAspects- representandinter-actwithfacts&data

DataGovernanceCertifiedQM

30©Cloudera,Inc.Allrightsreserved.

• Idea&Vision• Material• Skills/Methods• Tools

SemanticLogging

• Propertywithname:(K,V) :key-valuepair• Propertyofathing:S=>(K,V) :(S,P,O)isa tripleKbecomesP; VbecomesO

• ManyofthosetriplesinonecommoncontextwithnameG:G=>(S,P,O)iscalledquad ornamedgraph

Wehavetohidethistechnicaldetailsfromusers!

Obviousfactshavetobeconnectedtotheknowledgegraphasdirectaspossible.• Log4Jistheloggingstandardwebuildon.• Usingstructureddatainsteadofplainstringsallowseasyparsing(e.g.,apachelogformat).• Triplerepresentationavoidsspecificparsingandmakeslogdatapartofthelinkeddatagraph.

31©Cloudera,Inc.Allrightsreserved.

• Idea&Vision• Material• Skills/Methods• Tools

Etosha Toolbox

Dataextractors,Datatransformers,

Ontologybasedorchestration,

Peopleandmachines,contribute facts,

Iterativeapproachwithclosedfeedback-loops,

Scalableenvironment…

CONCEPT

32©Cloudera,Inc.Allrightsreserved.

• Idea&Vision• Material• Skills/Methods• Tools

Multi-layermetadatacapturing

OperationalmetricsMetricsabout fast&staticdataBusinessmetrics

ContextualizedpresentationAd-hocqueries forexplorationGraph-analytics

>Knowledgeexposure

>Self-ServiceDSandBIcanspeakthesamelanguage.

INITIAL

IMPLEMENTATION

33©Cloudera,Inc.Allrightsreserved.

Results:BetterCollaborationfor(Hadoop)KnowledgeWorkers

• OurAchievements:• Theopengraphmodelislanguage-,OS-,andhardware-independent.• Mergingofknowledgepartitionsenables clusterspanningmetadataexploration.• Querybeansexposefactsfrommultiplestorestoweb-basedinterfaces.

• NextSteps:• Improveimplicittriplification (QuerySolr-indexandgetRDFdata)• Standardizetheprocessandintegratewithexistingontologies.• Growacommunity…andentertheApacheIncubator.

34©Cloudera,Inc.Allrightsreserved.

Results:AccessFacts & Context ofCriticalProcessesDEMO:https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be

35©Cloudera,Inc.Allrightsreserved.

Thankyou!

ManythankstotheClouderateamwhichsupportedthiswork.