View
220
Download
0
Category
Preview:
DESCRIPTION
Why use the Wide Table? Several Scenarios New generation of e-commerce application By Rekesh Agrawal, VLDB 2001, ICDE 2003 By Jennifer L. Beckmann, ICDE 2006 &Eric Chu, SIGMOD 2007 Data publishing and annotating service del.icio.us, flicker UTab CIDR 2007 Novel web application Google Base: allow user structured their uploaded content Ebay: market-place Craigslist: community portals Search engine crawler data Bigtable Medical information system Distributed workload management system Condor Semantic Web data storage By Daniel J. Abadi, VLDB 2007, Best paper
Citation preview
Using Wide Table to Manage Web Data: A SurveyBin Yangbyang@fudan.edu.cn
Outline
Why use the Wide Table Data Model Physical Implementation Distributed Deployment Query Executions Other Issues
Why use the Wide Table? Several Scenarios
New generation of e-commerce application By Rekesh Agrawal, VLDB 2001, ICDE 2003 By Jennifer L. Beckmann, ICDE 2006 &Eric Chu, SIGMOD 2007
Data publishing and annotating service del.icio.us, flicker UTab CIDR 2007
Novel web application Google Base: allow user structured their uploaded content Ebay: market-place Craigslist: community portals
Search engine crawler data Bigtable
Medical information system Distributed workload management system
Condor Semantic Web data storage
By Daniel J. Abadi, VLDB 2007, Best paper
The character of Wide Table
DataVery wide schema, has many attributesSparsity Schema evolves constantly and quickly
QueryMore flexible query than SQLMore structured than keyword search
Conventional Database Horizontal Representation
N-ary horizontal representation Optimized for dense and slowly evolved data
Problem Large Number of Columns. Current database system has a restriction on the
number of columns in one table. Eg: 1012 (DB2 and Oracle) Sparsity. Many NULLs occupy storage, also increase the size of index, even
affect the sort result. Schema Evolution. Frequently altering schema is expensive Query performance. If a few columns are used in the query, the query incurs
a large performance penalty. Debate
Whether such wide table is a real need or just inappropriate design The distribution of non-NULL is unknown at schema design time Schema evolvement Reconstruct cost
Data Model
n-ary Horizontal Representation 2-ary Binary Representation 3-ary Vertical Representation Hybrid Representation
n-ary Horizontal Representation A basic example. See figure 1. As many columns as the number of attributes When a new attribute is brought in, the schema has to
altered Another approach: Attributes are divide into “dense”
and “sparse”. Dense are stored in a horizontal table, Sparse are stored in a plain text object.
2-ary Binary Representation DSM (Decomposed Storage Model) SIGMOD 1985 Decomposed horizontal tables into as many 2-ary
tables as the number of columns.
Number of columns. Just two.
Sparsity. NULL need not to store. Left Outer Join to reconstruct.
Schema evolution. Equals to add or delete a DSM table. On the other hand, too many tables makes DBMS hard to manage.
Performance of queries involved a few attributes are increased.
3-ary Vertical Representation
Number of columns. Just three.
Sparsity. NULL need not to store.
Schema evolution. Equals to add or delete a row.
Queries of vertical tables are much more complicated.
Unlike Horizontal Representation, Binary Representation and Vertical Representation decouple the logical and physical storage of entities.
Because most current applications and development tools are designed for horizontal format, reconstruct from either Binary Representation and Vertical Representation to horizontal table is necessary.
Hybrid RepresentationUsed in Bigtable and HBase
Divide into several column family. Row + timestamp equals surrogate in DSM.
Less tables than DSM. Vertical representation is used
in each table
Physical Implementation
Row Oriented Storage (stores data row by row)Positional Format Interpreted Format
Column Oriented Storage (stores data column by column)
Positional Format Always has a header
Relation-id:schema Tuple-id, tuple length NULL-bitmap, timestamp
Fixed length 8, 255, 1, 10, 4 bytes
respectively Variable length
A,B: fixed C,D,E: variable Offset:length, pointer are in
header Variable length attributes
always follows the fixed length attribute.
Fixed length attribute is pre-allocated no matter whether it is null
Schema evolving is expensive in both fixed and variable length record.
Deal with NULL in Positional Format Storage NULL-bitmap: Record header using it to indicate
which attribute is NULL A bit in NULL-bitmap Full size of the fixed length attribute is wasted Variable length attribute is omitted
Bitmap only Not pre-allocated for fixed length attribute Location is more complicated than pre-allocated PostgreSQL
A special value: The size should be small Length-data pairs: Length is 0 Interpreted format
Interpret Format
Column Oriented Storage Row-oriented is write-optimized
Push all of the fields of a single record out to disk Column-oriented is read-optimized
Firstly applied in data warehouse system Advantages
Support efficient queries over wide schema, improved bandwidth utilization and cache locality
Efficiently deal with sparse columns, improve data compression Use CPU cycles to save disk bandwidth Query on compressed representation, just decompress when presented the
data to user Disadvantage
Increased cost of inserts Increased record reconstruction cost
Typical example of Column Oriented C-Store http://db.csail.mit.edu/projects/cstore/ MonetDB http://monetdb.cwi.nl/projects/monetdb/Home/index.html
Compression method
Run Length Code (RLE) A list of non-NULL values Offset Bitmap Position Ranges
Challenge A “storage wizard” to automatically decide positional?
Interpreted? Horizontal? Vertical? According to density, frequency of access
Distributed Deployment In order to provide more storage capacity, high availability and high
performance Many nodes, each with private disk and private memory (shared-nothing
architecture) GFS (Google File System)
One master single point failure – shadow similar to Napster
Many Chunkserver Data transfer between chunkserver without interaction with master
Each chunk has 3 replicas Availability & Performance
Bigtable Based on GFS Partition horizontally according to row key, each partition is called tablet
C-Store Just implement projections Each projection is partition horizontally Different projections may have same columns Projections may have replicas, while different replicas can have different sort
order K-safe: can tolerate K failures
Query Execution
Query on Binary Representation Query on Vertical Representation Query on Interpreted Representation Partial and sparse index Keyword search Partitions, Hidden Schema and Virtual Relation Ranking
Query on Binary Representation Suitable for queries which involved a small number of attributes Transformation between binary and horizontal
Store non-NULL only
Physical layout One is clustered on surrogate, suitable for B2H The other is clustered on the attribute value. Suitable for specific
queries.
Query on Vertical Representation Transformation between vertical and horizontal
Physical layout One is clustered on object identifier The other is clustered on attribute name. Suitable for V2H.
Query on Interpreted Representation No need to reconstruct the horizontal representation EXTRACTION operation
Get the offset of each attribute Expensive, execute in batch
Partial Index In horizontal table, column involved in frequent queries is always
indexed. In vertical table, all three columns are indexed
The entire table are indexed. The size is much more bigger The large indices adversely impact the performance of vertical
representation (each update has to modify the index) Partial index: only the rows of interest needs to be indexed
Proposed by Stonebraker 1989 SIGMOD Use a predicate, only the tuples which is evaluated as true are
indexed Challenge: How to identify the real interest rows Format:
CREATE index-type INDEX on relationname(column name) where predicate
CREATE B-tree INDEX on EMP(salary) where salary < 500 Sparse Index
In interpreted table, only index the non-NULL values. Index size is proportional to the number of non-NULL value. For insertion and deletion operation, only the index on the
attributes that are non-NULL need to update
Keyword search Most suitable way
Ordinary user don’t know the exactly attribute name, because there are “too many attributes”
Ordinary user may not write a SQL query Inverted index in classical IR
One inverted index on data value (traditional IR) One inverted index on attribute name (UTab)
Problem Too many records may contain a specific keyword Zipf-like distribution, accepted by most users
Number of attributes contained the term Number of rows contained the term
Imprecision Maybe a user want to find keyword in some specific attribute More structured keyword search
A example of structured keyword search Fuzzy attribute
Name based schema matching techniques WordNet semantic dictionary
Suppose: A2 is fuzzy attribute, and A2 is similar to C21 and C22
Alternative Run keyword search on the data value to obtain a set of
objects Z. Find out A, which is the set of attributes in Z that contain the
keyword. Match A with fuzzy attributes to get the B. Return objects in Z which has attribute B
Partitions, Hidden Schema and Virtual RelationVertical partition the data set in a Wide tableScanning the vertical partition is more efficient
than scanning the base tableHow to partition
A reasonable number of partitions Partitions contain minimal null values Each base-table tuple is preferably store entirely in
one partitionA common way
Group together co-occurring attributes Challenge: How to define the degree of co-occurring
Classification Disjoint partition (Hidden Schema) Joint partition (Virtual Relation)
After get the partition Materialized views (positional format because of it is dense) Covering index (a way to find out the efficient partial index) Provide browsing-based interface
Hidden Schema (by Eric Chu, Jeffrey Naugthon, wisconsin-madison)
Jaccard coefficient Statistical information
Virtual relation (Applied in UTab by NUS) Clustered on attributes and tags Semantic information
Ranking SQL query is always unordered set of qualifying
records Flexible query should have a order
Keyword search Most classical method is tf-idf Modern search engine always involve many aspects with
different weights Timestamp
Classification of keyword in attribute name and value data, different weights
Hidden schema or virtual relation should also have a ranking
Decreasing order according to the number of tuples it contains
Weak data typesArbitrary stringTimestamp
ConsistencyCAP theorem
Consistent, availability, tolerance of partitions Write-once-read-many, most queries are read-only
Other Issues
[1] R. Agrawal, A. Somani, and Y. Xu. Storage and querying of ecommerce data. In Proc. Of VLDB, pages 149-158, 2001.
[2] G.P. Copeland and S.N. Khhoshafian. A decomposition storage model. In SIGMOD 1985. [3] J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse
datasets using an interpreted attribute storage format. In Proc. of ICDE, 2006. [4] B. Yu, G. Li, B. C. Ooi, and L.Z. Zhou. One Table Stores All: Enabling Painless Free-and- Easy Data Publishing and Sharing. In CIDR 2007. [5] M. Stonebraker, D.J. Abadi, A. Batkin et al. C-Store: a Column-Oriented DBMS. In Proc. of VLDB 2005. [6] E. Chu, J. Beckmann, J. Naughton. The Case for aWide-Table Approach to Manage Sparse Relational Data Sets. In SIGMOD 2007. [7] C. Cunningham, C.A. Galindo-Legaria, and G. Graefe. PIVOT and UNPIVOT: Optimization and Execution
Strategies in an RDBMS. In VLDB 2004. [8] F. Chang, J Dean, S. Ghemawat et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. [9] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In SOSP 2003. [10] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. [11] D.J. Abadi. Column Stores For Wide and Sparse Data. In CIDR 2007. [12] J. Madhavan, S.R. Jeffery, and S. Cohen. Web-scale Data Integration: You can only afford to Pay As You Go. In
CIDR 2007. [13] A.S. Hoque. Storage and Querying of High Dimensional Sparsely Populated Data in Compressed
Representation. EurAsia-ICT 2002. [14] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases.In Proc. of VLDB, 2002. [15] J. Madhavan, A. Halevy and S. Cohen et al. Structured Data Meets the Web: A Few Observations. IEEE Data
Engineering Bulletion, 29(4), December 2006. [16] Hadoop website. http://lucene.apache.org/hadoop/ [17] HBase website. http://wiki.apache.org/lucene-hadoop/Hbase [18] Delicious website. http://del.icio.us/ [19] Flickr website. http://www.flickr.com/ [20] Google Base website. http://base.google.com/ [21] Google Co-op website. http://www.google.com/coop [22] M. Stonebraker. The case for partial indexes. SIGMOD Record, 1989. [23] WordNet website. http://wordnet.princeton.edu/ [24] R. Agrawal, R. Srikant and Y. Xu. Database Technologies for Electronic Commerce. In Proc. of VLDB, 2002. [25] Database Complete Book [26] E.A.Brewer. Combining Systems and Databases: A Search Engine Retrospective.
Recommended