Upload
taliyah-worstell
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
1
Tough Choices
Materialize nothing.Compute every cell on demand.Worst query response time.No space requirements.
Materialize part of the data cube.Many cells are computable from other cells.But which cells to materialize?More cells = better query performance.
Materialize the entire data cube.Best query response time.Excessive space requirements.
2
Data Value Hypercube
DATA VALUE HYPERCUBES store data-record indices, whereas existing data cubes can only store data aggregates.
versus ordinary data cubes
DATA VALUE HYPERCUBES are generated as quickly as existing data cubes.
3
Remember this?
Now it doesn’t matter.
OLTP
OLAP
UNSTRUCTUREDDATA
STRUCTUREDDATA
EmailMulti-
DimensionalDatabases
XML
EDISpreadsheets
Web Pages
RSS
Web Log
Voice recognition
Instant Messaging
Wikis
Content Management
Document Management
Taxonomies,OntologiesMultimedia
LegacyDatabases
RelationalDatabases
Main FrameDatabases
+80%
-80%
4
Hypercubes are constructed so that each cell corresponds to a unique combination of database attribute values.
3 attributes require at least 8 cells.
Hypercube
5
6
CustomerPart
Customer
CustomerSupplier
None
PartSupplier
Supplier
Part
CustomerPartSupplier
7
CustomerSupplierBoeingBoeing
DeltaFedEx
LockheedLockheed
DeltaFedEx
CustomerPartSupplierBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
PartSupplier
BoeingBoeingBoeing
CockpitJet EngineWing
LockheedLockheedLockheed
CockpitJet EngineWing
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
SupplierBoeingLockheed
CustomerDeltaFedEx
None
CockpitJet EngineWing
Part
8
CustomerSupplierBoeingBoeing
DeltaFedEx
LockheedLockheed
DeltaFedEx
CustomerPartSupplierBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
PartSupplier
BoeingBoeingBoeing
CockpitJet EngineWing
LockheedLockheedLockheed
CockpitJet EngineWing
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
SupplierBoeingLockheed
CustomerDeltaFedEx
None
CockpitJet EngineWing
Part
1
23
4
5
6 7
8
3 attributes require at least 8 cells.
9
CustomerPartSupplierBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
Sales$10$20$30$40$50$60
$70$80
$90$100$110$120
PartSupplier
BoeingBoeingBoeing
CockpitJet EngineWing
LockheedLockheedLockheed
CockpitJet EngineWing
Sales
$30$110$190$70$150$230
CockpitJet EngineWing
Part Sales$100$260$420
SupplierBoeingLockheed
Sales$330$450
CustomerDeltaFedEx
Sales$360$420
AllSales
$780
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
Sales
$40$60$120$140$200$220
CustomerSupplierBoeingBoeing
DeltaFedEx
LockheedLockheed
DeltaFedEx
Sales$150$180$210$240
This is entirely fictional data.
10
Lattice Notation
A lattice is denoted as (L, <=).L = the set of elements (queries).<= is the dependence relation.
ancestor(a) = {b | a <= b}.descendant(a) = {b | b <= a}.Every element is its own descendant and ancestor.next(a) = the immediate proper ancestors of a.next(a) = {b | a < b, there exists a < c, c < b}.
11
Lattice Diagrams
Lattice diagrams are graphs.
Elements are nodes.
There is an edge from a to b iff b is in next(a).
There is a path downward from y to x iff x <= y.
12
Hypercube AlgebraSimple database warehouse example. Parts are purchased from suppliers and then sold to customers. Three dimensions: Part, Supplier, and Customer. The measure of interest is total sales. For each cell (p, s, c), store the total sales of part p that was bought from supplier s, and sold to customer c. Users are interested in consolidated sales. Example: what is the total sales of a given part p to a given customer c? This query is answered by looking up the value in cube cell (p, ALL, c).
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
Sales
$40$60$120$140$200$220
Many cells are computable from other cells.Dependent cells.Example: cell (p, ALL, c) is the sum of cells (p, s1, c), …, (p, sn, c).
13
The Dependence Relationon Queries
Consider two queries Q1 and Q2.
Q1 ≤ Q2 iff Q1 can be answered using only Q2.
Q1 is dependent on Q2.
For example, the query (part), can be answered using only the query (part, customer).
(part) <= (part, customer).
Some queries are not comparable with each other using the <= operator.
For example, (part) !<= (customer) and (customer) !<= (part).
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
Sales
$40$60$120$140$200$220
14
B-TREE LOGICEASIER THAN IT LOOKS
A C E G I K M O Q S U W Y Z
B F J N R V X
D L T
H P
1 3 5 7 9 11 13 15 17 19 21 23 25 26
2 6 10 14 18 22 24
4 12 20
8 16
15
B-TREE LOGICB IS FOR BALANCED
100 20 50 80 99GIVEN 3RD ORDER B TREE
WITH THE NUMBERS:
20
8050 9990 10
INSERT 9
49
8050 99100 20
INSERT 49
50
8051 99100 20
INSERT 51
Insert any number < 20 and
becomes the root.
Insert any number > 50 and
becomes the root.
Insert any number > 20 and< 50 and it becomes the root.
50
20
16
B-Tree Forest
Construction time for the tree forest is
where d is the
number of query dimensions and ni is the
O( 1≤ i ≤ d (log ni))
number of attributes in the database at level d.
17
B-Tree Forest
A Balanced B-Tree Forest is the data structure that is used to represent a Hypercube.
Each dimension in the Hypercube is represented by a separate B-Tree.
B-Trees are great for storing sparse data and have fast insertion and search characteristics, (nlogn).
18
B-Tree Forest
A binary tree forest consists of multiple levels of binary trees.
Each level represents a cube dimension.
A binary tree consists of nodes – stems or leaves.
Stems nodes point to left and right binary trees.
Leaf nodes point to a linked list of fact table IDs.
A linked list of fact table IDs points to fact table entries with identical attribute values.
A depth first search on a binary tree forest results in a GROUP BY clause.
19
CustomerPartSupplierBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
Sales$10$20$30$40$50$60
$70$80
$90$100$110$120
PartSupplier
BoeingBoeingBoeing
CockpitJet EngineWing
LockheedLockheedLockheed
CockpitJet EngineWing
Sales
$30$110$190$70$150$230
CockpitJet EngineWing
Part Sales$100$260$420
SupplierBoeingLockheed
Sales$330$450
CustomerDeltaFedEx
Sales$360$420
AllSales
$780
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
Sales
$40$60$120$140$200$220
CustomerSupplierBoeingBoeing
DeltaFedEx
LockheedLockheed
DeltaFedEx
Sales$150$180$210$240
B-Tree Forest in Reverse: A primer
BoeingLockheed
Cockpit
WingJet Engine
DeltaFedEx
Supplier Tree Customer TreeParts Tree
20
Extensive B-Trees Are Common
BOEING
GENERAL DYNAMICS
LOCKHEED MARTIN
HONEYWELL INT’L NORTHROP GRUMMAN
UNITED TECHNOLOGIES
AVIONICS
ELEVATOR
JET ENGINE
AILERON FLIGHT CONTROLS
STABILIZER
COCKPIT
FIN FUSELAGE
RUDDER
WING
LANDING GEAR
SOUTHWEST
DHL
DELTA
VIRGINFED EX
But let’s keep it simple for now.
21
PartSupplier
BoeingBoeingBoeing
CockpitJet EngineWing
LockheedLockheedLockheed
CockpitJet EngineWing
Sales
$30$110$190$70$150$230
CockpitJet EngineWing
Part Sales$100$260$420
CustomerDeltaFedEx
Sales$360$420
AllSales
$780
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
Sales
$40$60$120$140$200$220
CustomerSupplierBoeingBoeing
DeltaFedEx
LockheedLockheed
DeltaFedEx
Sales$150$180$210$240
Incoming Data StreamSupplierBoeingLockheed
Sales$330$450
CustomerPartSupplier SalesBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
$10$20$30$40$50$60
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
$70$80
$90$100$110$120
CustomerPartSupplier Sales
CustomerPartSupplierBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
Sales$10$20$30$40$50$60
$70$80
$90$100$110$120
DATA FLOW
Chunk 1 Chunk 12 intervals of Data FlowChunk 2Chunk 1
22
Setting up Fact & Dimension TablesSupplierBoeingLockheed
Sales$330$450
CustomerPartSupplier SalesBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
$10$20$30$40$50$60
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
$70$80
$90$100$110$120
CustomerPartSupplier Sales
Chunk 2Chunk 1
CustomerPartPartSupplier Sales
CockpitCockpit
BoeingBoeing
DeltaFedEx
$10$20$30$40$50$60
StringIDGlobal String Table
Boeing
Boeing0Lockheed1Cockpit2
Jet Engine3
PartPart
Wing4Delta5
FedEx6
Lockheed
Cockpit
Jet Engine
Wing
Delta
FedEx
UNSORTED
StringIDSupplier Dimension Table
Boeing 00Lockheed 11
StringIDPart Dimension Table
Cockpit 20Jet Engine 31
Wing 42
StringIDCustomer Dimension Table
Delta 50FedEx 61
SORTED
SupplierIDFact TablePart Customer Sales
00 0 0 $1001 0 1 $2012 0 0 $3013 0 1 $4004 1 0 $5005 1 1 $6016 1 0 $7017 1 1 $8008 2 0 $9009 2 1 $100110 2 0 $110111 2 1 $120
23
Let’s just say ‘Parts’ is the most significant data of interest.
IDFact Table
Sales0 $101 $202 $303 $404 $505 $606 $707 $808 $909 $100
10 $11011
Customer010101010101 $120
Supplier001100110011
Part000011112222
24
Understanding Nested B-Trees
IDFact Table
Sales0 $101 $202 $303 $404 $505 $606 $707 $808 $909 $100
10 $110
Supplier00110011001111
Part000011112222
Customer010101010101 $120
ID Fact Table
Sales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier001100110011
11
Part000011112222
Customer010101010101 $120
IDFact Table
Sales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier0
01
10
01
10
01
1
11
Part0
00
01
11
12
22
2
Customer01
01
01
01
01
01
$120
ID
Fact Table
Sales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier00
11
00
11
00
11
11
Part0
00
01
11
12
22
2
Customer01
01
01
01
01
01
$120
ID
Fact Table
Sales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier00
11
00
11
00
11
11
Part00
00
11
11
22
22
Customer01
01
01
01
01
01
$120
ID
Fact Table
Sales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier00
11
00
11
00
11
11
Part00
00
11
11
22
22
Customer
01
01
01
01
01
01
$120
ID
Fact Table
Sales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier
00
11
00
11
00
11
11
Part0
00
01
11
12
22
2
Custom
er
01
01
01
01
01
01
$120
ID
Fact T
ableS
ales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier
001100110011
11
Part000011112222
Custom
er
010101010101$120
ID
Fact T
ableS
ales
0
$10
1
$20
2
$30
3
$40
4
$50
5
$60
6
$70
7
$80
8
$90
9
$100
10
$110
Supplier
001100110011
11 Part
000011112222
Custom
er
010101010101$120
IDF
act Ta
bleS
ales
0$10
1$20
2$30
3$40
4$50
5$60
6$70
7$80
8$90
9$10
010
$110
Sup
plier001100110011
11
Part000011112222
Custom
er010101010101
$120
25
Understanding Nested B-Trees
IDF
act Ta
bleS
ales
0$10
1$20
2$30
3$40
4$50
5$60
6$70
7$80
8$90
9$10
010
$110
Sup
plier001100110011
11
Part000011112222
Custom
er010101010101
$120
Fact Table
$10$20$30$40$50$60$70$80$90$10
0$11
0$12
0 Sales
001100110011
Supplier
000011112222
Part
010101010101
Customer
ID01234567891011 ID
StringIDSupplier Dimension Table
Boeing 00Lockheed 11
StringIDPart Dimension Table
Cockpit 20Jet Engine 31
Wing 42
StringIDCustomer Dimension Table
Delta 50FedEx 61
Wing Cockpit
B B BL L L
D DDDDD F FFFFF
Jet EngineJet EngineWing Cockpit
26
Delta
FedEx
Delta
FedEx
Delta
FedEx
Delta
FedEx
Delta
FedEx
Making a B-Tree Forest
IDF
act Ta
bleS
ales
0$10
1$20
2$30
3$40
4$50
5$60
6$70
7$80
8$90
9$10
010
$110
Sup
plier001100110011
11
Part000011112222
Custom
er010101010101
$120
Fact Table
$10$20$30$40$50$60$70$80$90$10
0$11
0$12
0 Sales
001100110011
Supplier
000011112222
Part
010101010101
Customer
ID01234567891011 ID
Wing Cockpit
B B BL L L
D DDDDD F FFFFF
Jet Engine Jet EngineWing Cockpit
BoeingLockheed
Boeing
Lockheed
Boeing
Lockheed
Delta
FedEx
Drilling down the Hypercube to a Single Data Value
27
Data Structure & Concept Side by Side
Do you see the Data Value Hypercube to the left?
Delta
FedEx
Delta
FedEx
Delta
FedEx
Delta
FedEx
Delta
FedEx
Boeing
Lockheed
Boeing
Lockheed
Delta
FedEx
Boeing
Lockheed
WingCockpit
Jet Engine
CustomerSupplierBoeingBoeing
DeltaFedEx
LockheedLockheed
DeltaFedEx
CustomerPartSupplierBoeingBoeing
CockpitCockpit
DeltaFedEx
LockheedLockheed
CockpitCockpit
DeltaFedEx
BoeingBoeing
Jet EngineJet Engine
DeltaFedEx
LockheedLockheed
Jet EngineJet Engine
DeltaFedEx
BoeingBoeing
WingWing
DeltaFedEx
LockheedLockheed
WingWing
DeltaFedEx
PartSupplier
BoeingBoeingBoeing
CockpitJet EngineWing
LockheedLockheedLockheed
CockpitJet EngineWing
CustomerPart
CockpitCockpit
DeltaFedEx
Jet EngineJet Engine
DeltaFedEx
WingWing
DeltaFedEx
SupplierBoeingLockheed
CustomerDeltaFedEx
CockpitJet EngineWing
Part
None
28
Network Data Stream
ProtocolContentID Destination IPSource IPTime Stamp
ProtocolContentID Destination IPSource IPTime Stamp000 243917285212285642861166832000001 173614669517614485151166832001002 486514255117644282461166832002013 197245657418924544581166832005024 452261735616548235421166832006135 285645987612467894371166832007246 243985245214685317531166832008357 153698576714359432481166832010358 131452528612458975611166832011469 1354457862164875474511668320124710 1371566218134751298511668320135811 4655814344182547555811668320145812 2564258624134287218411668320156913 12452382181347164817116683202071014 136545775413448/4687116683202181115 185425756917475485281166832022
ProtocolContentID Destination IPSource IPTime Stamp8110 4825212523121458752811668320308111 149246455512457985661166832031802 139876124714361875611166832032813 175292458217621485681166832033814 258624588416745657231166832040815 439621558914365854791166832041816 179255865717985468221166832042827 134282315515875663121166832043828 191274638613456796581166832044829 48312536741486144679116683204581010 13482364871736569518116683204681011 14675884871344188545116683204781012 1135416853145587526711668320489913 4231144559155879646711668320499914 142355257717526214431166832050
StringIDSMB0LDAP1SSH2AOL3
JPEG4ENGLISH5
ZIP6COMPRESS7
GIFF8POP9
SMPT10IMAP11FTP12
TELNET13SKYPE14
CMS15
GLOBAL String Table
FRENCH16RUSSIAN17
BMP18BASIC SOURCE19
C SOURCE20DISCOVER21
String Table IDIDBASIC SOURCE 190
BMP 181C SOURCE 202
CMS 153COMPRESS 74DISCOVER 215
ENGLISH 56FRENCH 167
GIFF 88JPEG 49
RUSSIAN 1710ZIP 611
CONTENT Dimension Table
String Table IDIDAOL 30FTP 121
IMAP 112LDAP 13 POP 94
SKYPE 145SMB 06
SMTP 107SSH 28
TELNET 139
PROTOCOL Dimension Table
Only showing 2 out of 16 NETWORK DATA STREAM Dimensions
29
B-TREE Notation
FTP
B (1,3)
Attribute Name
Node
B
Level Record Number
30
NETWORK DATA STREAM
POP
B (1,9)
AOL
B (1,7)
IMAP
B (1,8)
SKYPE
B (1,4)
FTP
B (1,3)
LDAP
B (1,1)
TELNET
B (1,6)
SMTP
B (1,5)
SSH
B (1,2)
SMB
B (1,0)
“Protocols” B-TREE
31
Notation
BMP4
B (7,9)(7,9)(7,9)(7,9)
Chunk Record Number
Attribute Name
Record Count
Tree nodes not only contain data aggregates but a linked list of data record indices.
32
“Content” B-Trees
ZIP3
(2,10) (2,11) (2,12)
C SOURCE4
(2,3) (2,4) (2,5) (2,6)
BMP1
(2,2)
BASIC SOURCE3
(1,15) (2,0) (2,1)
RUSSIAN3
(2,7) (2,8) (2,9)
B (1,8) SSH
C SOURCE1
(1,4)
BMP1
(1,3)
BASIC SOURCE3
(1,0) (1,1) (1,2)
B (1,0) AOL
CMS1
(1,5)
B (1,1) FTP
COMPRESS1
(1,6)
B (1,2) IMAP
DISCOVER2
(1,7) (1,8)
B (1,3) LDAP
FRENCH1
(1,9)
B (1,4) POP
GIFF1
(1,10)
B (1,5) SKYPE
JPEG2
(1,11) (1,12)
B (1,6) SMB
RUSSIAN1
(1,14)
B (1,7) AOL
33
B-Tree Forest
POP
B (1,9)
AOL
B (1,7)
IMAP
B (1,8)
SKYPE
B (1,4)
FTP
B (1,3)
LDAP
B (1,1)
TELNET
B (1,6)
SMTP
B (1,5)
SSH
B (1,2)
SMB
B (1,0)
Pointer
C SOURCE1
(1,4)
BMP1
(1,3)
BASIC SOURCE3
(1,0) (1,1) (1,2)
B (1,0) AOL
Level
Index of Treeat the same level
34
ZIP3
(2,10) (2,11) (2,12)
C SOURCE4
(2,3) (2,4) (2,5) (2,6)
BMP1
(2,2)
BASIC SOURCE3
(1,15) (2,0) (2,1)
RUSSIAN3
(2,7) (2,8) (2,9)
B (1,8) SSH
C SOURCE1
(1,4)
BMP1
(1,3)
BASIC SOURCE3
(1,0) (1,1) (1,2)
B (1,0) AOL
CMS1
(1,5)
B (1,1) FTP
COMPRESS1
(1,6)
B (1,2) IMAP
DISCOVER2
(1,7) (1,8)
B (1,3) LDAP
FRENCH1
(1,9)
B (1,4) POP
GIFF1
(1,10)
B (1,5) SKYPE
JPEG2
(1,11) (1,12)
B (1,6) SMB
RUSSIAN1
(1,14)
B (1,7) AOL
POP
B (1,9)
AOL
B (1,7)
IMAP
B (1,8)
SKYPE
B (1,4)
FTP
B (1,3)
LDAP
B (1,1)
TELNET
B (1,6)
SMTP
B (1,5)
SSH
B (1,2)
SMB
B (1,0)
35
Conclusion
B-tree forests are limited to data aggregates. Data aggregates only identify the existence of a dimensional combination. They do not provide access to complete data records.
With current OLAP implementations, examining data records requires issuing additional database queries, which is inefficient.
We solve this problem by extending a balanced b-tree forest to include references to data records. We call this new type of hypercube: the data value cube. Thus for our data cube, tree nodes not only contain data aggregates but a linked list of data record indices.
36
THE Q&A
Stephen A. Broeker