7
1) Introduction to Graph Database Management System Graph DBMS are the type of DBMS that deal with Graph databases which are a type of datastore in which the relationship between things is of the equal important as the things themselves. Examples of datasets that are natural fit for graph databases: A computer network Friend links on a social network The world wide web In graph databases, a thing (a person, a website, a host) is referred to as a “node,” while a relationship between two things (a friendship, connected hosts, a href) is referred to as an “edge.” In most types of databases, the records stored in the database are nodes, and edges (relationships) are derived from a field on a node. In a SQL database, for example, we might have a table called “people” that includes a field “friend_id.” friend_id is a reference to another record in the people table. The weakness with reference fields becomes apparent as soon as we want to do many-to-many relationships, or store data about the relationship. A person can have many friends; and we might want to track the date the friendship link was created, or whether the two people are married. The solution to this in a SQL database is a join table. In the people/friends example, your join table might be called “friendships”. But this method has some weaknesses. One is that it can greatly increase the number of tables in your database, and may make it hard to tell apart standard tables (nodes) from join tables (edges) - which makes it more difficult for new developers to comprehend the database architecture. But the biggest weakness is that queries against relationship data - be it in join table or a reference link - are extremely unwieldy. In a SQL database it typically leads to recursive joins, which tend to lead to long, incomprehensible SQL statements and unpredictable performance.

Introduction to Graph Database Management System

Embed Size (px)

Citation preview

Page 1: Introduction to Graph Database Management System

1) Introduction to Graph Database Management System

Graph DBMS are the type of DBMS that deal with Graph databases which are a type of datastore in which the relationship between things is of the equal important as the things themselves. Examples of datasets that are natural fit for graph databases:

A computer network Friend links on a social network The world wide web

In graph databases, a thing (a person, a website, a host) is referred to as a “node,” while a relationship between two things (a friendship, connected hosts, a href) is referred to as an “edge.”

In most types of databases, the records stored in the database are nodes, and edges (relationships) are derived from a field on a node. In a SQL database, for example, we might have a table called “people” that includes a field “friend_id.” friend_id is a reference to another record in the people table.

The weakness with reference fields becomes apparent as soon as we want to do many-to-many relationships, or store data about the relationship. A person can have many friends; and we might want to track the date the friendship link was created, or whether the two people are married.

The solution to this in a SQL database is a join table. In the people/friends example, your join table might be called “friendships”. But this method has some weaknesses. One is that it can greatly increase the number of tables in your database, and may make it hard to tell apart standard tables (nodes) from join tables (edges) - which makes it more difficult for new developers to comprehend the database architecture. But the biggest weakness is that queries against relationship data - be it in join table or a reference link - are extremely unwieldy. In a SQL database it typically leads to recursive joins, which tend to lead to long, incomprehensible SQL statements and unpredictable performance.

A graph database is designed to represent this type of information, so it models the data more naturally. It’s also designed to query it: you can walk the data in a convenient and preferment manner.

Page 2: Introduction to Graph Database Management System

Image Source: http://www.linkeddatatools.com/introducing-rdf

2) Query Processing in the GDBMS

Organizations publishing data in a queryable, public SPARQL endpoint will probably in most circumstances not just want anyone writing and making changes to their (valuable) data. But also, at some level, it does seem like it's a missing requirement, particularly if you're using triple data stores locally rather than publicly across the web and want to make changes using a query language (like you would with a SQL database). How could this be done?

At the moment, new specifications such as SPARUL (SPARQL/Update) and SPARQL+ are being developed to address this problem; however a solid contender to fill this gap in functionality is yet to come.

As RDF data is more widely adopted however, expect a winning contender to emerge and be implemented by Semantic Web frameworks. The Resource Description Framework (RDF) can be viewed from at least two perspectives: (1) From a logical perspective, as a minimal fragment of logic that includes all relevant features needed as representation language for metadata, or as the W3C recommendation [Hay04] says: RDF is an assertional language intended to be used to express propositions using precise formal vocabularies; and (2) From a database perspective, as an extension of data models used in the database community, in particular graph database models. The former point of view has been an active area of research. This does not come as surprise knowing that RDF emerged as a language to represent metadata on theWeb, distilling the experience of the community of knowledge representation and Web researchers and developers.

RDF data stores can be queried using their own query language - SPARQL (SPARQL Protocol and RDF Query Language, pronounced "sparkle"). SPARQL is, however, a little more sophisticated.

Page 3: Introduction to Graph Database Management System

1.PREFIX sch-ont:   <http://education.data.gov.uk/def/school/>2.SELECT ?name WHERE {3.  ?school a sch-ont:School.4.  ?school sch-ont:establishmentName ?name.5.  ?school sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/00AA>.6.}7.ORDER BY ?name

SPARQL selects data from the query data set by using a SELECT statement to determine which subset of the selected data is returned. Also, SPARQL uses a WHERE clause to define graph patterns to find a match for in the query data set. Graph patterns in a SPARQL WHERE clause consists of the subject, predicate and object triple to find a match for in the data. Let's explore this further by taking a closer look at the example above.

In SPARQL, variable names are prefixed with the question mark ("?") symbol. In a query graph pattern, they match any node - whether resource or literal.

Notice that this variable is also given in the WHERE clause search pattern - on the object of the second query search pattern. But also note the ?school variable too. Because a specific URI has not been stated for a match but a variable, any matching subject URI will be returned for this part of the query pattern and the result will be mapped onto that variable name.

Hence, in the above SPARQL query, ?name returns all the names of the schools which match the three search patterns given in the query. If we wanted, we could make the query more specific by adding additional match criteria. Or, we could make it more broad for example by removing the last search pattern requiring the school to match the district administrative value "00AA".

Lastly, note that the ?school variable means that for all three search patterns, any subject will match the search pattern and will be returned to this variable. But, since it is not stated in the SELECT statement for this query, ?school is mapped, but not returned in the result set.

3) Index Structures supported by the GDBMS

Graph indexing plays a critical role in graph containment query processing on large graph databases which have gained increasing popularity in bioinformatics, Web analysis, pattern recognition and other applications involving graph structures. To avoid unnecessary traversals on the database during the evaluation of a path expression, indexing methods are introduced. One solution to graph containment query is to index paths in graph

Page 4: Introduction to Graph Database Management System

databases, and this solution is often referred to as the path-based indexing approach. A graph containment query is answered in two phases: the first filtering phase selects a set of candidate graphs from G in which the number of each indexed path-feature is at least that of the query. The second verification phase verifies each graph in the candidate answer set derived from the first phase, as opposed to G, by subgraph isomorphism testing. False positives are discarded and the true answer set is returned.

In comparison to the path-based indexing approach, there exists another mechanism using graphs as basic indexing features, which is often referred to as graph-based indexing approach. A distinguished example of this approach is GIndex. GIndex takes advantage of a graph mining procedure to discover frequent graphs from G, by which the index is constructed. In order to scale down the exponential number of frequent graphs, GIndex selects only discriminative ones as indexing features. GIndex has several advantages over GraphGrep. First, structural information of graph is well preserved, which is critical to filter false positives in the verification phase; Second, the number of discriminative frequent graph-features is much smaller than path-features, so that the index is compact and easy to be accommodated in main memory; Third, discriminative frequent graphs are relatively stable to database updates, which makes incremental index maintenance feasible. The disadvantages of gIndex are obvious. First, because index construction is a time-consuming graph mining procedure, the computationally expensive (sub) graph isomorphism testings are unavoidable. The index construction cost can be even high when |G| is large or graphs in G are large and diverse. Second, GIndex assumes that discriminative frequent graphs discovered from G are most likely to appear in query graphs, too.

Another method of indexing graph databases is used in order to facilitate subgraph isomorphism and similarity queries. The index is comprised of two major data structures. The primary structure is a directed acyclic graph which contains a node for each of the unique, induced subgraphs of the database graphs. The secondary structure is a hash table which cross-indexes each subgraph for fast isomorphic lookup. In order to create a hash key independent of isomorphism, a code-based canonical representation of adjacency matrices is used, which is further refined to improve computation speed. Experiments show that for subgraph isomorphism queries, this method outperforms the traditional methods by more than an order of magnitude.

4) Query Optimization using Indexes

SQL Server can use indexes to perform seek and scan operations. Indexes can be used to speed up the execution of a query by quickly finding records without performing table scans; by delivering all the columns requested by

Page 5: Introduction to Graph Database Management System

the query without accessing the base table (i.e. Part of the Query Optimizer’s job is to determine if an index can be used to evaluate a predicate in a query. This is basically a comparison between an index key and a constant or variable. In addition, the Query Optimizer needs to determine if the index covers the query; that is, if the index contains all the columns required by the query (referred to as a “covering index”). It needs to confirm this because, as you’ll hopefully remember, a non-clustered index usually contains only a subset of the columns of the table.

SQL Server can also consider using more than one index, and joining them to cover all the columns required by the query (index intersection). If it’s not possible to cover all of the columns required by the query, then the query optimizer may need to access the base table, which could be a clustered index or a heap, to obtain the remaining columns. This is called a bookmark lookup operation (which could be a Key Lookup or an RID Lookup. However, since a bookmark lookup requires random I/O, which is a very expensive operation, using both an index seek and a bookmark lookup can only be effective for a relatively small number of records.

Although one or more indexes can be used, it does not mean that they will be finally selected in an execution plan, as this is always a cost-based decision. So, after creating an index, we need to verify that the index is, in fact, used in a plan (and of course, that the query is performing better, which is probably the primary reason why we define an index!) An index that it is not being used by any query will just take up valuable disk space, and may negatively impact the performance of update operations without providing any benefit. It is also possible that an index which was useful when it was originally created is no longer used by any query. This could be as a result of changes in the database schema, the data, or even the query itself. To help you avoid this frustrating situation.

5) Conclusion:

The reason to use a graph database is that the data being stored by the system and the operations the system is doing with the data are exactly the weak spot of relational databases and are exactly the strong spot of graph databases. The system needed to store collections of objects that lack a fixed schema and are linked together by relationships. To reason about the data, the system needed to do a lot of operations that would be a couple of traversals in a graph database, but that would be quite complex queries in SQL.

The main advantages of the graph model are rapid development time and flexibility. We could quickly add new functionality without impacting existing deployments. Flexibility also helped when we were designing a new feature, saving us from trying to squeeze new data into a rigid data model.

Page 6: Introduction to Graph Database Management System

If you're building a product for enterprise customers and your data fits into the relational model, use a relational database if you can. If your application doesn't fit the relational model but it does fit the graph model, use a graph database. If it only fits something else, use that. Whatever you chose, try not to build the database engine yourself unless you really like building database engines.

Defining the index structure for graph data bases can improve the performance of database and speed up the retrieval of records. It depends, in which situation which index is to be used, path index, GIndex, Tree index, or Tree + delta. We can also use combination of these as well.

Query optimization explained how can we define the key of graph indexes so that they are likely to be considered for seek operations, which can improve the performance of graph queries by finding records more quickly.