Upload
gagan-agrawal
View
107
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Using Graph Databases For Insights Into Connected Data
Gagan Agrawal
Xebia
2
Agenda
High level view of Graph Space Comparison with RDBMS and other NoSQL
stores Data Modeling Cypher : Graph Query Language Graphs in Real World Graph Database Internals
3
What is a Graph?
4
Graph
5
What is a Graph? A collection of vertices and edges. Set of nodes and the relationships that
connect them. Graph Represents -
Entities as NODES The way those entities relate to the world
as RELATIONSHIP Allows to model all kind of scenarios
System of road Medical history Supply chain management Data Center
6
Example – Twitter's Data
7
Example – Twitter's Data
8
High Level view of Graph Space
Graph Databases - Technologies used
primarily for transactional online graph persistence – OLTP.
Graph Compute Engines - Tecnologies used primarily for offline graph analytics - OLAP.
9
Graph Databases Online database management system with
-Create, Read, Update, Delete
methods that expose a graph data model. Built for use with transactional (OLTP)
systems. Used for richly connected data. Querying is performed through traversals. Can perform millions of traversal steps per
second. Traversal step resembles a join in a RDBMS
10
Graph Database Properties
The Underlying Storage : Native / Non-Native
The Processing Engine : Native / Non-Native
11
Graph DB – The Underlying Storage Native Graph Storage – Optimized and
designed for storing and managing graphs.
Non-Native Graph Storage – Serialize the graph data into a relational database, an object oriented database, or some other general purpose data store.
12
Native Graph Storage
13
Graph DB – The processing Engine Index free adjacency – Connected Nodes
physically point to each other in the database
14
Non-Native : Index Look-Up
15
Native : Index Free Adjacency
16
Graph Databases
17
Power of Graph Databases Performance
Flexibility
Agility
18
Comparison Relational Databases
NoSQL Databases
Graph Databases
19
Relational Databases Lack Relationships Initially designed to codify paper forms and
tabular structures. Deal poorly with relationships. The rise in connectedness translates into
increased joins. Lower performance. Difficult to cater for changing business
needs.
20
RDBMS
21
RDBMS
What products did a customer buy?
Which customers bought this product?
Which customers bought this product who also bought that product?
22
RDBMS
23
Query to find friends-of-friends
24
NoSQL Databases also lack Relationships NOSQL Databases e.g key-value, document
or column oriented store sets of disconnected values/documents/columns.
Makes it difficult to use them for connected data and graphs.
One of the solution is to embed an aggregate's identifier inside the field belonging to another aggregate.
Effectively introducing foreign keys Requires joining aggregates at the
application level.
25
NoSQL DB
26
NoSQL DB Relationships between aggregates aren't first
class citizens in the data model. Foreign aggregate "links" are not reflexive. Asking the database "Who has bought a
particular product" is an expensive operation. Need to use some external compute
infrastructure e.g Hadoop for such processing. Do not maintain consistency of connected
data. Do not support index-free adjacency.
27
NoSQL DB
28
Graph DB Embraces Relationships
29
Graph DB Find friends-of-friends in a social network,
to a maximum depth of 5. Total records : 1,000,000 Each with approximately 50 friends
30
Graph DB
31
NoSQL Comparison
32
Data Modeling with Graph
33
Data Modeling “Whiteboard” friendly
The typical whiteboard view of a problem is a GRAPH.
Sketch in our creative and analytical modes, maps closely to the data model inside the database.
34
The Property Graph Model
35
Cypher : Graph Query Language Pattern-Matching Query Language Humane language Expressive Declarative : Say what you want, now how Borrows from well know query languages Aggregation, Ordering, Limit Update the Graph
36
Cypher Cypher Representation : (c)-[:KNOWS]->(b)-[:KNOWS]->(a), (c)-[:KNOWS]-
>(a)
(c)-[:KNOWS]->(b)-[:KNOWS]->(a)<-[:KNOWS]-(c)
37
Cypher
START c=node:user(name='Michael')MATCH (c)-[:KNOWS]->(b)-[:KNOWS]->(a),
(c)-[:KNOWS]->(a)RETURN a, b
38
Other Cypher Clauses WHERE
Provides criteria for filtering pattern matching results.
CREATE and CREATE UNIQUE Create nodes and relationships
DELETE Removes nodes, relationships and
properties SET
Sets property values
39
Other Cypher Clauses FOREACH
Performs an updating action for graph element in a list.
UNION Merge results from two or more queries.
WITH Chains subsequent query parts and forward
results from one to the next. Similar to piping commands in UNIX.
40
Comparison of Relational and Graph Modeling
41
Systems Management Domain
42
Entity Relationship Diagram
43
Tables and Relationships
44
Graph Representation
45
Query to find faulty Equipment
46
Matched Paths
47
Fine Grained vs Generic Relationships
DELIVERY_ADDRESS
VS
ADDRESS{type : 'delivery'}
48
49
50
Graphs in the Real World
51
Common Use Cases Social Recommendations Geo Logistics Networks : for package routing, finding
shortest Path Financial Transaction Graphs : for fraud detection
Master Data Management Bioinformatics : Era7 to relate complex web of
information that includes genes, proteins and enzymes Authorization and Access Control : Adobe
Creative Cloud, Telenor
52
Graph Database Internals
53
Non Functional Characteristics Transactions
Fully ACID Recoverability Availability Scalability
54
Scalability Capacity (Graph Size)
Latency (Response Time)
Read and Write Throughput
55
Capacity 1.9 Release of Neo4j can support single
graphs having 10s of billions of nodes, relationships and properties.
The Neo4j team has publicly expressed the intention to support 100B+ nodes/relationships/properties in a single graph as part of its 2013 roadmap.
56
Latency RDBMS – more data in tables/indexes result
in longer join operations. Graph DB doesn't suffer the same latency
problem. Index is used to find starting node. Traversal uses a combination of pointer
chasing and pattern matching to search the data.
Performance does not depend on total size of the dataset.
Depends only on the data being queried.
57
Throughput Constant performance irrespective of
graph size.
58
Who uses Neo4j ?
59
Resources
60
Thank You