Upload
jeffry-hunt
View
217
Download
4
Tags:
Embed Size (px)
Citation preview
VIIQ: AutoSuggestionEnabled Visual Interface for
Interactive Graph Query FormulationPresented by: Zohreh
Fall 2015
Authors
Nandish Jayaram
University of Texas at Arlington
Sidharth Goyal
University of Texas at Arlington
Chengkai Li
University of Texas at Arlington
Introduction
An unprecedented proliferation of heterogeneous graph
with thousands of node/edge types
Complex relationships in schema-less data
Query graphs are used to:
specify the query intent for such graphs
Formulating these query graphs is a daunting task
users to know a vocabulary comprised of many labels and types
Introduction
Graph query systems allow users to construct query graphs
through a visual interface
The focus of these systems is query processing
Their query formulation components are limited to a graphical platform
To add nodes and edges with ease using mouse and keyboard actions
Little help is offered to easily choose the labels
various components in a query graph
Introduction
Every time a new query component is added
Users are inundated with possibly hundreds of options
For the new component’s label, sorted alphabetically.
It is a daunting task to browse through all the options
To select the appropriate label to add
Related works
There are other querying paradigms that help users query graph data
Declarative languages like SPARQL are used to exactly specify query intent
But present a usability barrier
Simplify query formulation
Keyword search, approximate graph query and query-by-example
Cannot be used to specify users’ exact query intent
Existing systems help users specify queries either easily or exactly,
But not both
VIIQ (Visual Interface for Interactive graph Query formulation),
To easily construct various query graph components
VIIQ automatically suggests new edges and nodes to add
To a partially constructed query graph
Users can also add nodes or edges manually,
Whose labels are ranked
presented on how likely they will be of interest to the user
VIIQ is the first visual query formulation system
Actively makes ranked suggestions
Contribution
VIIQ supports two modes of operation, passive and active
By default VIIQ operates in passive mode
Based on the partially constructed query graph
the system automatically recommends top-k new edges
relevant to the user’s query intent
Fig. 3 shows the snapshot of a partially constructed query graph, with nodes and edges suggested in passive mode.
The nodes in grey and the edges incident on them are the new automatic suggestions made by the system.
Contribution
The active mode is triggered
the user adds new nodes or edges to the partial query graph
For a newly added node, the suggested labels are displayed hierarchically
In a pop-up box
For a newly added edge, the suggested edge labels are ranked
based on the likelihood of their relevance to the user’s query intent
On Uncertain Graphs Modeling and Queries
Presented by: Zohreh
Fall 2015
Authors
Arijit Khan Lei Chen† Systems Group,∗ ∗
ETH Zurich, Switzerland †
The Hong Kong University of Science and Technology
Introduction
Availability of network data have increased dramatically
Uncertainty is evident in graph data due to a variety of reasons
such as noisy measurements, inconsistent, incorrect, and possibly ambiguous information sources
In these cases, data is represented as an uncertain graph
Nodes, edges, and attributes are accompanied with a probability of existence
MODELING OF UNCERTAIN GRAPHS
Uncertainty Models Independent Probabilities:
Components in the graph independent from one another
Interprets uncertain graphs according to the well-known possible-world semantics
For example, an uncertain graph with m edges yields 2 power of m possible deterministic graphs
Correlated Probabilities
Ignores the correlations among various graph components
For example, in a traffic network:
If a road is crowded at a certain point of time
There are a few works that model such correlations with conditional probabilities
Challenges: semantic and computation
From the perspective of the semantics:
There is no uniform model of uncertain graphs;
Assignment and interpretation of the probabilities
application specific.
Define the shortest path between two nodes in an uncertain graph?
The definition could depend on the application and the specific uncertainty semantic
Challenges : computation perspective
While many graph algorithms such as subgraph isomorphism are intrinsically hard problems,
Even the simplest graph algorithms such as reachability and shortest path queries become #P-complete;
More expensive over uncertain graphs
Therefore, exact computation is almost infeasible
with today’s large-scale graph data
Focus now-a-days is towards designing approximation algorithms
With efficient sampling, indexing, and filtering strategies
MAJOR OPEN PROBLEMS
An exact computation is infeasible
Over large scale uncertain graphs,
It is important to identify the application areas
e.g. efficiency vs. effectiveness
To re-define the semantics of many classical graph operations
e.g., centrality measure and graph partitioning
A Demonstration of TripleProv:Tracking and Querying Provenance over
Web Data
Presented by: Ashkan Malekloo
Fall 2015
Sharing and Reproducing Database Applications
Type: Demonstration paper
Authors:
VLDB 15
Marcin Wylot , Philippe Cudre �-Mauroux, Paul Groth
Introduction
Heterogeneity of RDF data
Ease of integration
Examples:
one may want to analyze which sources were instrumental in providing results
How data sources were combined
Filtering the result
Introduction
Find me all the titles of articles about “Obama”but derive the answer only from sources attributed to “US News”.
Introduction
No current triple store is able to automatically derive provenance data for the results it produces or to tailor queries with provenance data.
Storing Quadruples
Named Graphs
TripleProv
A new RDF database system supporting the transparent and automatic derivation of detailed provenance information for arbitrary queries and the execution of queries incorporating provenance predicates
It is based on a native RDF store
Enables to trace provenance at two different granularity levels
SAASFEE: Scalable Scientific Workflow Execution Engine
Marc Bux Jo rgen Brandt ̈� Carsten Lipka
Kamal Hakimzadeh Jim Dowling Ulf Leser
Demonstration Paper
2015 VLDB
Presented by: Omar Alqahtani
Fall 2015
Motivation
Scientific data is analyzed by complex pipelines composed of highly specialized, domain-dependent tools.
SWfMSs facilitate the design, implementation, execution, optimization, monitoring, and exchange of such heterogeneous pipelines.
Existing SWfMSs
Roughly divided into three groups:
Taverna, Kepler, Galaxy
Askalon, Pegasus
YARN, MESOS
No platform capable of:
Embrace the ever-evolving research tools
Scaling to very large data sets
Executing arbitrarily complex workflows.
SAASFEE
It is a SWfMS which runs arbitrarily complex workflows on Hadoop YARN.
SAASFEE workflows are specified in Cuneiform.
Cuneiform workflows are executed on Hi-WAY.
Capabilities
The ability to execute iterative workflows,
An adaptive task scheduler,
Re-executable provenance traces,
Compatibility to selected other workflow systems
Tutorial: SQL-on-Hadoop Systems
Presented by:
Ranjan
Fall 2015
Why SQL? Data exploration
Structured data
organization of the data in tables
optimized data access
Declarative data processing
No need to have developer skills
Portable – universal language
SQL drivers supported
No need of Hadoop client installation
Easier integration with the current systems
Hadoop overview
These factors complicate the query optimization further in the Hadoop system
First, in the world of Hadoop and HDFS data, complex data types, such as arrays, maps, structs, as well as JSON data are more prevalent.
Second, the users utilize UDFs (user-defined-functions) very widely to express their business logic, which is sometimes very awkward to express in SQL itself.
Third, often times there is little control over HDFS. Files can be added or modified outside the tight control of a query engine, making statistics maintenance a challenge.
Hive The first SQL-on-Hadoop offering that provided an SQL-like query language, called HiveQL,
and used MapReduce run-time to execute queries.
Hadapt Hadapt, which spun out of the HadoopDB research project, was the first commercial SQL-on-
Hadoop offering. Hadapt and HadoopDB replaced the file-oriented HDFS storage formats with DBMS-oriented storage, including column-store.
Spark Spark is a fast, general purpose cluster computing engine that is compatible with
Hadoop data and tries to address the shortcomings of MapReduce. Systems that use Spark as their run-time for SQL processing:
-Shark , Hive on Spark , and Spark SQL.
Cloudera Impala
fully-integrated MPP SQL query engine.
Impala reads at almost disk bandwidth and is typically able to saturate all available disks.
IBM Big SQL
It leverages IBM’s state-of-the-art relational database technology, to processes standard SQL queries over HDFS data, supporting all common Hadoop file formats, without introducing any propriety formats.
Apache Drill
providing SQL-like declarative processing over self-describing semi-structured data.
Its focus is on analyzing data without imposing a fixed schema or creating tables in a catalog like Hive MetaStore.
Splice Machine
Splice Machine provides SQL support over HBase data using Apache Derby, targeting both operational as well as analytical workloads.
Phoenix Phoenix provides SQL querying over HBase via an embeddable JDBC driver built for high
performance and read/write operations
Collaborative Data Analytics with DataHub
Authors: Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang.
Type: Demonstration paper
Presented by: Dardan Xhymshiti
Fall 2015
Major problem Organizations and companies collect data from various sources like:
Financial transactions,
Server logs,
Sensor data etc.
Teams and individuals inside the company want to use these dataset for extracting knowledge from them, using their home-grown tools, company tools, different programming languages, so making modifications on the data set (normalization, cleaning) and then exchanging these dataset back and forth.
Problem: collaborative data analysis. Heterogeneity of tools, diversity in skill-set of individuals and teams, difficulties
on sorting, difficulties on retrieving and versioning of the exchanged datasets.
Major motivation The authors motivate they work by providing two examples: Example 1: Expert analysis:
Members of an web advertising team want to extract knowledge from an unstructured ad-click data. They write a script for extracting the task-relevant information from the data, and store it as a separate dataset which will be shared across the team.
Problems:
Different team members may be more comfortable with a particular tool: R, Python, Awk, and use these tool to clean, normalize and summarize the dataset.
More proficient members use multiple languages for different purposes:
• Modeling in R.
• Visualization in JavaScript
• String extraction in Awl etc.
The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….
Versioning is difficult to manage in case of a hundred data set versions. The final result…:
The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….
Versioning is difficult to manage in case of a hundred data set versions. The final result…:
Example 2: Novice analysis: The coach and players of a football team want to study, query and visualize
their performance over the last season. Probably they are going to use a tool like Excel for storing their data set, which
have limited support on querying, cleaning, analysis or versioning. Query example: The coach wants to find all the games where a star player was
absent? Most of the team players are not proficient with data analysis tools, such as
SQL or scripting languages. Solution of the problem: Point-and-click apps. These apps offer:
Easy load, query, visualize and share results with other users without much effort.
These teams are unable to perform collaborative data analysis because of the lack of:
1. Flexible data sharing and versioning support
2. Point-and-click apps to help novice users do collaborative data analysis
3. Support for a number of data analysis languages and tools.
A tool for collaborative analysis can be used for example by genetics who want to share and collaborate on genome data with other research groups.
Major Contribution To address these problems the paper presents DataHub a unified data management
and collaboration platform for hosting, sharing, combining and collaboratively analyzing datasets.
DataHub has three main components:
1. Flexible data storage, sharing, and versioning capabilities.
a) Keeps track of all versions of dataset.
b) Enables collaborative analysis, while at the same time allows storing and retrieving these datasets at various stages of analysis.
2. App ecosystem for easy querying, cleaning, and visualization.
a) Distill: data cleaning by example tool.
b) DataQ: a query builder tool that allows user to build SQL queries by direct manipulation in graphical user interface. Interface is suitable for non-technical users.
c) Dviz: Data visualization tool.
3. Language-agnostic hooks for external data analysis.
For the team members that are proficient on different languages and libraries like: Python, R, Scala and Octave, the DataHub enable collaborative data analysis by using
Apache Thrift to translate between these languages and datasets in DataHub.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Authors: Tomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, Kaushik Veeraraghavan. (Facebook, Inc)
Presented by: Dardan Xhymshiti
Fall 2015
Major problem Large-scale internet services (i,e. Facebook) must be highly available and
responsive in case of unexpected failures.
These large-scale services performs on a thousand of systems running on many thousand machines, that are located in different geographical areas.
These services also have a global audience of users.
Problems arise if there does not exist good failure monitoring systems.
Major motivation
The authors, are motivated by the previous problems, to come up with a solution that best ensures the availability and responsiveness of large-scale internet services.
Major contribution Authors present an in-memory Time Series Data Base (TSDB) called Gorilla, which in
second-basis gets measuring data points (CPU load, error rate, latency) from distributed machines, stores them in TSDB and perform queries on top of it.
Challenge: High data insertion rate, total data quantity, real-time aggregation and reliability.
Rather than storing measuring data points as individual data points, they are aggregated and then stored.
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
Gorilla TSDB
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
New software release
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
New software release
A network cut
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
New software release
A network cut
Side effect of an configuration change.
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.
Gorilla TSDB
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Fault tolerance
The writes are replicated to multiple regions so in case of a datacenter failure, the data are survived.
High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.
Gorilla TSDB
Traditional ACID guarantees are not a core requirement for TSDB.
The writes must succeed at all times, even in the face of disasters.
Recent data points are of higher value than older data points (knowing if a particular system is broken right now is more valuable to an operations engineer than knowing if it was broken an hour ago).
Challenge: speed of query processing, writes and reads. Solution: Replacing the disk-based database with an in-memory database.
Facts
In Spring 2015 Facebook’s monitoring system generated 12 billion data points per second.
12 billion data points per second = 1 trillion data points per day
Problem: 1 trillion data points * 16 byte = 16TB of RAM. (Too resource intensive) Solution: Using XOR based floating point compression , a data point from 16 bytes
was compressed to an average of 1.37 bytes (12x reduction on size).
Compressed Spatial Hierarchical Bitmap (cSHB) Indexes for Efficiently Processing Spatial Range
Query WorkloadsPresented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Publication:
VLDB 2015
Type:
Research Paper
Motivation: Bitmap-Based Indexing
Bitmap indexes have been shown to be highly effective in answering queries in data warehouses and column-oriented data stores. Why?
1. Efficient implementations of the bitwise logical operations: “AND”, “OR”, and “NOT”;
2. Provide significant opportunities for compression, enabling either reduced I/O or, even, complete in-memory maintenance of large index structures.
3. Query processors can operate directly on compressed bitmaps.
Motivation: Quad-Tree Indexing
A quad-tree is a data structure used to divide a 2D region into more manageable parts. It's an extended binary tree, but instead of two child nodes it has four.
Introduction (1)
The key principle behind most indexing mechanisms is to ensure that data objects closer to each other in the data space are also closer to each other on the storage medium.
Total order in 1D space: easy
Total order in nD: complicated!
Common solution? Partitioning the space hierarchically in such a way that (R/KD trees):
Nearby points fall into the same partition.
Point pairs that are far from each other fall into different partitions.
Alternative?
Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data.
Introduction (2)
Alternative?
Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data such that:
Data objects closer to each other in the original space are also closer to each other on the 1D space.
Data objects further away from each other in the original space are also further away from each other on the 1D space.
How? fractal-based space-filling curves, In particular, the Peano-Hilbert curve and Z-order curve have been shown to be very effective in helping cluster nearby objects in the space.
Contribution
It is shown that bitmap-based indexing is also an effective solution for managing spatial data sets.
proposed compressed spatial hierarchical bitmap (cSHB) indexes to support spatial range queries.
converting the given 2D space into a 1D space using Z-order traversal.
For spatial query processing:
A cost model was developed.
Choosing the best nodes for query processing according to the cost model.
Contribution (2)
Contains the following 1D ranges(000010, 000011, 001000, 001001, 001010, 001011)
Related Work
Multi-Dimensional Space Partitioning
Quad-tree, BD-tree, G-Tree, and KD-tree.
R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree, and others).
Space Filling Curve based Indexing
Peano-Hilbert curve: better mapping but costly.
Z-order curve: efficient (used in this paper).
Bitmap Indexes
Experimental Results
Datasets: 100 million synthetically generated data points ranging from <−180,−90> to <180, 90>.
A clustered data set from Gowalla, which contains the locations of check-ins made by users.
A clustered data set from OpenStreetMap (OSM).
Experimental Results (2)