70
VIIQ: AutoSuggestion Enabled Visual Interface for Interactive Graph Query Formulation Presented by: Zohreh Fall 2015

Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li

Embed Size (px)

Citation preview

Page 1: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

VIIQ: AutoSuggestionEnabled Visual Interface for

Interactive Graph Query FormulationPresented by: Zohreh

Fall 2015

Page 2: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Authors

Nandish Jayaram

University of Texas at Arlington

Sidharth Goyal

University of Texas at Arlington

Chengkai Li

University of Texas at Arlington

Page 3: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

An unprecedented proliferation of heterogeneous graph

with thousands of node/edge types

Complex relationships in schema-less data

Query graphs are used to:

specify the query intent for such graphs

Formulating these query graphs is a daunting task

users to know a vocabulary comprised of many labels and types

Page 4: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

Graph query systems allow users to construct query graphs

through a visual interface

The focus of these systems is query processing

Their query formulation components are limited to a graphical platform

To add nodes and edges with ease using mouse and keyboard actions

Little help is offered to easily choose the labels

various components in a query graph

Page 5: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

Every time a new query component is added

Users are inundated with possibly hundreds of options

For the new component’s label, sorted alphabetically.

It is a daunting task to browse through all the options

To select the appropriate label to add

Page 6: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Related works

There are other querying paradigms that help users query graph data

Declarative languages like SPARQL are used to exactly specify query intent

But present a usability barrier

Simplify query formulation

Keyword search, approximate graph query and query-by-example

Cannot be used to specify users’ exact query intent

Existing systems help users specify queries either easily or exactly,

But not both

Page 7: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

VIIQ (Visual Interface for Interactive graph Query formulation),

To easily construct various query graph components

VIIQ automatically suggests new edges and nodes to add

To a partially constructed query graph

Users can also add nodes or edges manually,

Whose labels are ranked

presented on how likely they will be of interest to the user

VIIQ is the first visual query formulation system

Actively makes ranked suggestions

Page 8: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Contribution

VIIQ supports two modes of operation, passive and active

By default VIIQ operates in passive mode

Based on the partially constructed query graph

the system automatically recommends top-k new edges

relevant to the user’s query intent

Page 9: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Fig. 3 shows the snapshot of a partially constructed query graph, with nodes and edges suggested in passive mode.

The nodes in grey and the edges incident on them are the new automatic suggestions made by the system.

Page 10: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Contribution

The active mode is triggered

the user adds new nodes or edges to the partial query graph

For a newly added node, the suggested labels are displayed hierarchically

In a pop-up box

For a newly added edge, the suggested edge labels are ranked

based on the likelihood of their relevance to the user’s query intent

Page 11: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

On Uncertain Graphs Modeling and Queries

Presented by: Zohreh

Fall 2015

Page 12: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Authors

Arijit Khan Lei Chen† Systems Group,∗ ∗

ETH Zurich, Switzerland †

The Hong Kong University of Science and Technology

Page 13: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

Availability of network data have increased dramatically

Uncertainty is evident in graph data due to a variety of reasons

such as noisy measurements, inconsistent, incorrect, and possibly ambiguous information sources

In these cases, data is represented as an uncertain graph

Nodes, edges, and attributes are accompanied with a probability of existence

Page 14: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

MODELING OF UNCERTAIN GRAPHS

Uncertainty Models Independent Probabilities:

Components in the graph independent from one another

Interprets uncertain graphs according to the well-known possible-world semantics

For example, an uncertain graph with m edges yields 2 power of m possible deterministic graphs

Page 15: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Correlated Probabilities

Ignores the correlations among various graph components

For example, in a traffic network:

If a road is crowded at a certain point of time

There are a few works that model such correlations with conditional probabilities

Page 16: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Challenges: semantic and computation

From the perspective of the semantics:

There is no uniform model of uncertain graphs;

Assignment and interpretation of the probabilities

application specific.

Define the shortest path between two nodes in an uncertain graph?

The definition could depend on the application and the specific uncertainty semantic

Page 17: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Challenges : computation perspective

While many graph algorithms such as subgraph isomorphism are intrinsically hard problems,

Even the simplest graph algorithms such as reachability and shortest path queries become #P-complete;

More expensive over uncertain graphs

Therefore, exact computation is almost infeasible

with today’s large-scale graph data

Focus now-a-days is towards designing approximation algorithms

With efficient sampling, indexing, and filtering strategies

Page 18: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

MAJOR OPEN PROBLEMS

An exact computation is infeasible

Over large scale uncertain graphs,

It is important to identify the application areas

e.g. efficiency vs. effectiveness

To re-define the semantics of many classical graph operations

e.g., centrality measure and graph partitioning

Page 19: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

A Demonstration of TripleProv:Tracking and Querying Provenance over

Web Data

Presented by: Ashkan Malekloo

Fall 2015

Page 20: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Sharing and Reproducing Database Applications

Type: Demonstration paper

Authors:

VLDB 15

Marcin Wylot , Philippe Cudre �-Mauroux, Paul Groth

Page 21: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

Heterogeneity of RDF data

Ease of integration

Examples:

one may want to analyze which sources were instrumental in providing results

How data sources were combined

Filtering the result

Page 22: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

Find me all the titles of articles about “Obama”but derive the answer only from sources attributed to “US News”.

Page 23: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction

No current triple store is able to automatically derive provenance data for the results it produces or to tailor queries with provenance data.

Storing Quadruples

Named Graphs

Page 24: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

TripleProv

A new RDF database system supporting the transparent and automatic derivation of detailed provenance information for arbitrary queries and the execution of queries incorporating provenance predicates

It is based on a native RDF store

Enables to trace provenance at two different granularity levels

Page 25: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

SAASFEE: Scalable Scientific Workflow Execution Engine

Marc Bux Jo rgen Brandt ̈� Carsten Lipka

Kamal Hakimzadeh Jim Dowling Ulf Leser

Demonstration Paper

2015 VLDB

Presented by: Omar Alqahtani

Fall 2015

Page 26: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Motivation

Scientific data is analyzed by complex pipelines composed of highly specialized, domain-dependent tools.

SWfMSs facilitate the design, implementation, execution, optimization, monitoring, and exchange of such heterogeneous pipelines.

Page 27: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Existing SWfMSs

Roughly divided into three groups:

Taverna, Kepler, Galaxy

Askalon, Pegasus

YARN, MESOS

No platform capable of:

Embrace the ever-evolving research tools

Scaling to very large data sets

Executing arbitrarily complex workflows.

Page 28: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

SAASFEE

It is a SWfMS which runs arbitrarily complex workflows on Hadoop YARN.

SAASFEE workflows are specified in Cuneiform.

Cuneiform workflows are executed on Hi-WAY.

Page 29: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Capabilities

The ability to execute iterative workflows,

An adaptive task scheduler,

Re-executable provenance traces,

Compatibility to selected other workflow systems

Page 30: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Tutorial: SQL-on-Hadoop Systems

Presented by:

Ranjan

Fall 2015

Page 31: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Why SQL? Data exploration

Structured data

organization of the data in tables

optimized data access

Declarative data processing

No need to have developer skills

Portable – universal language

SQL drivers supported

No need of Hadoop client installation

Easier integration with the current systems

Page 32: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Hadoop overview

Page 33: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

These factors complicate the query optimization further in the Hadoop system

First, in the world of Hadoop and HDFS data, complex data types, such as arrays, maps, structs, as well as JSON data are more prevalent.

Second, the users utilize UDFs (user-defined-functions) very widely to express their business logic, which is sometimes very awkward to express in SQL itself.

Third, often times there is little control over HDFS. Files can be added or modified outside the tight control of a query engine, making statistics maintenance a challenge.

Page 34: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Hive The first SQL-on-Hadoop offering that provided an SQL-like query language, called HiveQL,

and used MapReduce run-time to execute queries.

Hadapt Hadapt, which spun out of the HadoopDB research project, was the first commercial SQL-on-

Hadoop offering. Hadapt and HadoopDB replaced the file-oriented HDFS storage formats with DBMS-oriented storage, including column-store.

Page 35: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Spark Spark is a fast, general purpose cluster computing engine that is compatible with

Hadoop data and tries to address the shortcomings of MapReduce. Systems that use Spark as their run-time for SQL processing:

-Shark , Hive on Spark , and Spark SQL.

Cloudera Impala

fully-integrated MPP SQL query engine.

Impala reads at almost disk bandwidth and is typically able to saturate all available disks.

Page 36: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

IBM Big SQL

It leverages IBM’s state-of-the-art relational database technology, to processes standard SQL queries over HDFS data, supporting all common Hadoop file formats, without introducing any propriety formats.

Apache Drill

providing SQL-like declarative processing over self-describing semi-structured data.

Its focus is on analyzing data without imposing a fixed schema or creating tables in a catalog like Hive MetaStore.

Splice Machine

Splice Machine provides SQL support over HBase data using Apache Derby, targeting both operational as well as analytical workloads.

Page 37: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Phoenix Phoenix provides SQL querying over HBase via an embeddable JDBC driver built for high

performance and read/write operations

Page 38: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Collaborative Data Analytics with DataHub

Authors: Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang.

Type: Demonstration paper

Presented by: Dardan Xhymshiti

Fall 2015

Page 39: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Major problem Organizations and companies collect data from various sources like:

Financial transactions,

Server logs,

Sensor data etc.

Teams and individuals inside the company want to use these dataset for extracting knowledge from them, using their home-grown tools, company tools, different programming languages, so making modifications on the data set (normalization, cleaning) and then exchanging these dataset back and forth.

Problem: collaborative data analysis. Heterogeneity of tools, diversity in skill-set of individuals and teams, difficulties

on sorting, difficulties on retrieving and versioning of the exchanged datasets.

Page 40: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Major motivation The authors motivate they work by providing two examples: Example 1: Expert analysis:

Members of an web advertising team want to extract knowledge from an unstructured ad-click data. They write a script for extracting the task-relevant information from the data, and store it as a separate dataset which will be shared across the team.

Problems:

Different team members may be more comfortable with a particular tool: R, Python, Awk, and use these tool to clean, normalize and summarize the dataset.

More proficient members use multiple languages for different purposes:

• Modeling in R.

• Visualization in JavaScript

• String extraction in Awl etc.

Page 41: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….

Versioning is difficult to manage in case of a hundred data set versions. The final result…:

Page 42: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….

Versioning is difficult to manage in case of a hundred data set versions. The final result…:

Page 43: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Example 2: Novice analysis: The coach and players of a football team want to study, query and visualize

their performance over the last season. Probably they are going to use a tool like Excel for storing their data set, which

have limited support on querying, cleaning, analysis or versioning. Query example: The coach wants to find all the games where a star player was

absent? Most of the team players are not proficient with data analysis tools, such as

SQL or scripting languages. Solution of the problem: Point-and-click apps. These apps offer:

Easy load, query, visualize and share results with other users without much effort.

Page 44: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

These teams are unable to perform collaborative data analysis because of the lack of:

1. Flexible data sharing and versioning support

2. Point-and-click apps to help novice users do collaborative data analysis

3. Support for a number of data analysis languages and tools.

A tool for collaborative analysis can be used for example by genetics who want to share and collaborate on genome data with other research groups.

Page 45: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Major Contribution To address these problems the paper presents DataHub a unified data management

and collaboration platform for hosting, sharing, combining and collaboratively analyzing datasets.

DataHub has three main components:

1. Flexible data storage, sharing, and versioning capabilities.

a) Keeps track of all versions of dataset.

b) Enables collaborative analysis, while at the same time allows storing and retrieving these datasets at various stages of analysis.

2. App ecosystem for easy querying, cleaning, and visualization.

a) Distill: data cleaning by example tool.

b) DataQ: a query builder tool that allows user to build SQL queries by direct manipulation in graphical user interface. Interface is suitable for non-technical users.

c) Dviz: Data visualization tool.

Page 46: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

3. Language-agnostic hooks for external data analysis.

For the team members that are proficient on different languages and libraries like: Python, R, Scala and Octave, the DataHub enable collaborative data analysis by using

Apache Thrift to translate between these languages and datasets in DataHub.

Page 47: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Authors: Tomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, Kaushik Veeraraghavan. (Facebook, Inc)

Presented by: Dardan Xhymshiti

Fall 2015

Page 48: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Major problem Large-scale internet services (i,e. Facebook) must be highly available and

responsive in case of unexpected failures.

These large-scale services performs on a thousand of systems running on many thousand machines, that are located in different geographical areas.

These services also have a global audience of users.

Problems arise if there does not exist good failure monitoring systems.

Page 49: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Major motivation

The authors, are motivated by the previous problems, to come up with a solution that best ensures the availability and responsiveness of large-scale internet services.

Page 50: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Major contribution Authors present an in-memory Time Series Data Base (TSDB) called Gorilla, which in

second-basis gets measuring data points (CPU load, error rate, latency) from distributed machines, stores them in TSDB and perform queries on top of it.

Challenge: High data insertion rate, total data quantity, real-time aggregation and reliability.

Rather than storing measuring data points as individual data points, they are aggregated and then stored.

Page 51: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

Gorilla TSDB

Page 52: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

State transitions

We want to identify the issues that arise in case of new changes happening to the system.

Gorilla TSDB

Page 53: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

State transitions

We want to identify the issues that arise in case of new changes happening to the system.

Gorilla TSDB

New software release

Page 54: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

State transitions

We want to identify the issues that arise in case of new changes happening to the system.

Gorilla TSDB

New software release

A network cut

Page 55: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

State transitions

We want to identify the issues that arise in case of new changes happening to the system.

Gorilla TSDB

New software release

A network cut

Side effect of an configuration change.

Page 56: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

State transitions

We want to identify the issues that arise in case of new changes happening to the system.

High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.

Gorilla TSDB

Page 57: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

State transitions

We want to identify the issues that arise in case of new changes happening to the system.

Fault tolerance

The writes are replicated to multiple regions so in case of a datacenter failure, the data are survived.

High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.

Gorilla TSDB

Page 58: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Traditional ACID guarantees are not a core requirement for TSDB.

The writes must succeed at all times, even in the face of disasters.

Recent data points are of higher value than older data points (knowing if a particular system is broken right now is more valuable to an operations engineer than knowing if it was broken an hour ago).

Challenge: speed of query processing, writes and reads. Solution: Replacing the disk-based database with an in-memory database.

Page 59: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Facts

In Spring 2015 Facebook’s monitoring system generated 12 billion data points per second.

12 billion data points per second = 1 trillion data points per day

Problem: 1 trillion data points * 16 byte = 16TB of RAM. (Too resource intensive) Solution: Using XOR based floating point compression , a data point from 16 bytes

was compressed to an average of 1.37 bytes (12x reduction on size).

Page 60: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Compressed Spatial Hierarchical Bitmap (cSHB) Indexes for Efficiently Processing Spatial Range

Query WorkloadsPresented by: Shahab Helmi

Fall 2015

Page 61: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Paper InfoAuthors:

Publication:

VLDB 2015

Type:

Research Paper

Page 62: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Motivation: Bitmap-Based Indexing

Bitmap indexes have been shown to be highly effective in answering queries in data warehouses and column-oriented data stores. Why?

1. Efficient implementations of the bitwise logical operations: “AND”, “OR”, and “NOT”;

2. Provide significant opportunities for compression, enabling either reduced I/O or, even, complete in-memory maintenance of large index structures.

3. Query processors can operate directly on compressed bitmaps.

Page 63: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Motivation: Quad-Tree Indexing

A quad-tree is a data structure used to divide a 2D region into more manageable parts. It's an extended binary tree, but instead of two child nodes it has four.

Page 64: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction (1)

The key principle behind most indexing mechanisms is to ensure that data objects closer to each other in the data space are also closer to each other on the storage medium.

Total order in 1D space: easy

Total order in nD: complicated!

Common solution? Partitioning the space hierarchically in such a way that (R/KD trees):

Nearby points fall into the same partition.

Point pairs that are far from each other fall into different partitions.

Alternative?

Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data.

Page 65: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Introduction (2)

Alternative?

Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data such that:

Data objects closer to each other in the original space are also closer to each other on the 1D space.

Data objects further away from each other in the original space are also further away from each other on the 1D space.

How? fractal-based space-filling curves, In particular, the Peano-Hilbert curve and Z-order curve have been shown to be very effective in helping cluster nearby objects in the space.

Page 66: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Contribution

It is shown that bitmap-based indexing is also an effective solution for managing spatial data sets.

proposed compressed spatial hierarchical bitmap (cSHB) indexes to support spatial range queries.

converting the given 2D space into a 1D space using Z-order traversal.

For spatial query processing:

A cost model was developed.

Choosing the best nodes for query processing according to the cost model.

Page 67: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Contribution (2)

Contains the following 1D ranges(000010, 000011, 001000, 001001, 001010, 001011)

Page 68: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Related Work

Multi-Dimensional Space Partitioning

Quad-tree, BD-tree, G-Tree, and KD-tree.

R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree, and others).

Space Filling Curve based Indexing

Peano-Hilbert curve: better mapping but costly.

Z-order curve: efficient (used in this paper).

Bitmap Indexes

Page 69: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Experimental Results

Datasets: 100 million synthetically generated data points ranging from <−180,−90> to <180, 90>.

A clustered data set from Gowalla, which contains the locations of check-ins made by users.

A clustered data set from OpenStreetMap (OSM).

Page 70: Presented by: Zohreh Fall 2015.  Nandish Jayaram  University of Texas at Arlington  Sidharth Goyal  University of Texas at Arlington  Chengkai Li

Experimental Results (2)