Graph Processing with Apache TinkerPop

Preview:

Citation preview

Graph Processing withApache TinkerPop (incubating)

Jason PluradSoftware Engineer, IBM | Committer, Apache TinkerPop

• ProjectUpdate• GraphLandscape• AGraphProblem• Hands-OnGraph

http://tinkerpop.apache.org

AboutMe• Twitter@pluradj• GitHub@pluradj• Openchannels– TinkerPopmailinglists– Titanmailinglist– StackOverflow

(Apache)TinkerPop (incubating)• 2009:Inception• 2012:TinkerPop 2• 2015:ApacheIncubator• 2016:TopLevelProject?– TLPVOTEpassed!–WaitingonboardmeetingtoestablishTLP

Podling Releases

• 3.0– Majorrefactor,Java8lambdaexpressions,GremlinServer,OLAPgraphcomputers

• 3.1– Hadoop2support,persistedRDDs• 3.2– OLAPjobchaining,OLAPgraphfilters,

performanceimprovements

Commongraphdatadomains• SocialNetworkAnalysis• ConfigurationManagementDatabase• MasterDataManagement• RecommendationEngines• KnowledgeGraphs• InternetofThings

PropertyGraphandGremlin• Structure– Vertex– Edge– Properties

• Gremlin– Domainspecificlanguage(DSL)forgraph– Dataflow:forwardandbackward– TraversalSteps– Bindingsfornon-JVMlanguages

ApacheTinkerPopGraphComputingFramework

GraphLandscape• GraphdatabasevsGraphprocessor– OLTPvsOLAP– Neighborhoodvswholegraph

• Multi-model:nottheonlystoreinyourapp

IBM Graph (Beta)

• ManagedGraph-as-a-Service(OLTP)• Focusonyourdata,notinstallandoperations• #sleepMore

http://ibm.biz/IBMGraph

Whatisthis?module.exports = xxxxxxx;function xxxxxxx (str, len, ch) {str = String(str);var i = -1;if (!ch && ch !== 0) ch = ' ';len = len - str.length;while (++i < len) {str = ch + str;

}return str;

}

AGraphProblem:DependencyManagement

• OnMarch22,2016npm broketheInternet• Left-padwasunpublished– 11linesofcode– WTFPLlicense– Hundredsofbreakingbuildsperminute– http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm

• ArewesafewithApache?

Questionsforthegraph• Whichdependenciesareatrisk?• Whichonesshouldberefactoredtoavoid?• Riskfactors– Unsuitablelicense– Singledeveloper– Toolittlecode/Toomuchcode– Changestoofrequently/Codeisstagnant– Nobodyelseisusingit

Let’sgoforaride!

Titan(Aurelius)• PickagraphdatabaseforOLTP…– ApachelicensebutnotinASF

• Codehasstagnatedintheopen– DataStax Enterprise(DSE)Graph– Wideopenopportunities• GenesisGraphisupnext!• ApacheS2Graph(incubating)• ApacheFlink (Gelly)• ApacheSolr (GraphQuery)

ApacheSparkorApacheGiraph• PickagraphprocessorforOLAP…– Sparkisthenewhotness– Giraph isbettersuitedforgiganticgraphs

• ByusingApacheTinkerPop andGremlin,wecanuseeitheroneseamlessly

VagrantandVirtualbox• Developersdon’talwaysgetkeystothecloud• Virtualmachinestotherescue– Host:16GBRAMormore– 3-4VMswith3GBRAM

• Proveoutyourgraphalgorithmsonasmalldatasetbeforewastingtimeonabigdataset

ApacheAmbari• SimpleinstallforApacheHadoopandrelatedApachebigdatapackages– HDFS,YARN,MapReduce,HBase,Spark,etc

• Managementandmonitoringdashboard• Enablesintegrationofothersoftware

Gettingthedata• NPMregistryrunsonApacheCouchDB• ReplicationinApacheCouchDB isawesome– https://skimdb.npmjs.com/registry

Transformthedata• ApacheCouchDB isadocumentstore• Dependenciesaregraphdata• Otherthingscanbetoo– Users– Keywords– License

• Graphmodeldependsonthequestionsyouwanttoaskofthegraph

NPMGraphSchema

Document250K

Package1.5M

Keyword81K

License2K

Person125K

license

dependencydevDependency

Hands-On:GremlinConsole

https://asciinema.org/a/21qk1rn9yt6tt7sour9w9ynxn

TheGraphComputer

AnatomyofaVertexProgram• Vertex-centricgraphlogic• Parallelexecution(BSP)

OutoftheboxVertexPrograms• Traversal• BulkLoader• BulkDumper• PageRank• PeerPressure

Hands-On:GraphProgram

OLAP Traversal Sources> graph = GraphFactory.open('conf/npmgraph-olap.properties')> g = graph.traversal().withComputer(SparkGraphComputer)> g = graph.traversal().withComputer(GiraphGraphComputer)

Graph Statistics via TraversalVertexProgram> g.V().count() // vertex count> g.E().count() // edge count> g.V().label().groupCount() // vertex label distribution> g.E().label().groupCount() // edge label distribution> g.V().properties().key().groupCount() // vertex property distribution

Nextstop?Moredata!• Graphsareforconnectingdata!• ConsumedatafromGitHub– Userdata– Staticcodeanalysis– Codeusageanalysis

• ConsumedatafromTwitter– Trendingnews– Securityalerts

Summary

• ApacheTinkerPop isforgraphcomputing• OLTPvs OLAPisanimportantdistinction– Gremlinallowsyoutoseamlessbridgethetwo

• Graphthinkingisdifferentthanrelational– Isthefuturemulti-model?

• Manyopportunitiestoinnovateinthisspace

Acknowledgements• MarkoRodriguez

– Gremlin language,GremlinOLAP• Ketrina Yim

– Illustrator,creatorofGremlinandfriends• StephenMallette

– TinkerPop releasemanager,Gremlinapplications• DanielKuppitz

– Gremlin languageguru

• DavidRobinson– Bigdata,multi-model

architect/developer

Questions?

Thankyou!

Recommended