37
Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan Narendula

Inspector Joins

Embed Size (px)

DESCRIPTION

Inspector Joins. By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005. Rammohan Narendula. Introduction. Query execution is I/O bound- so most of the research concentrates on main memory Goal- reduce no. of page faults thus reduce no. of disk I/Os. - PowerPoint PPT Presentation

Citation preview

Page 1: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 1

Inspector Joins

By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry

VLDB 2005

Rammohan Narendula

Page 2: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 2

Introduction

Query execution isI/O bound- so most of theresearch concentrateson main memory Goal- reduce no. of pagefaults thus reduce no. of disk I/Os

However, hash join is a special class of techniqueswhere hash-join becomesCPU bound given sufficientI/O bandwidth and employingAdvanced I/O techniques (I/O prefetching)Goal- reduce no. of cache misses

Page 3: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 3

Exploiting Information about Data

• Ability to improve query depends on information quality• General stats on relations are inadequate

– May lead to incorrect decisions for specific queries

– Especially true for join queries

• Previous approaches exploiting dynamic information– Collecting information from previous queries

• Multi-query optimization [Sellis’88]

• Materialized views [Blakeley et al. 86]

• Join indices [Valduriez’87]

– Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04]

This study exploits the inner structure of hash joins

Page 4: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 4

Exploiting Multi-Pass Structure of Hash Joins

• Idea: – Examine the actual data in I/O partitioning phase

– Extract useful information to improve join phase

I/O Partitioning Join

Extra information greatly helps phase 2

Inspection

Page 5: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 5

Using Extracted Information

• Enable a new join phase algorithm – Reduce the primary performance bottleneck in hash joins

i.e. Poor CPU cache performance– Optimized for multi-processor systems

• Choose the most suitable join phase algorithm for special input cases

I/O Partitioning

decide Cache

PartitioningCache Prefetching

Simple Hash JoinInspection

Join Phase

New AlgorithmExtracted Information

Page 6: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 6

Outline

• Motivation• Previous hash join algorithms• Hash join performance on SMP systems• Inspector join• Experimental results• Conclusions

Page 7: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 7

Hash Table

• Join Phase: (simple hash join)– Build hash table, then probe hash table

GRACE Hash Join• I/O Partitioning Phase:

– Divide input relations into partitions with a hash function

Build Probe

Build Probe

• Random memory accesses cause poor CPU cache performance

Over 70% execution time

stalled on cache misses!

Page 8: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 8

Cache Partitioning• Recursively produce cache-sized partitions after I/O partitioning

• Avoid cache misses when joining cache-sized partitions• Overhead of re-partitioning

BuildProbeMemory-sized

PartitionsCache-sized

Partitions

Page 9: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 9

Cache Prefetching• Reduce impact of cache misses

– Exploit available memory bandwidth– Overlap cache misses and computations– Insert cache prefetch instructions into code

• Still incurs the same number of cache misses

Hash Table

ProbeBuild

Page 10: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 10

Outline

• Motivation• Previous hash join algorithms• Hash join performance on SMP systems• Inspector join• Experimental results• Conclusions

Page 11: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 11

Hash Joins on SMP Systems• Previous studies mainly focus on uni-processors

• Memory bandwidth is precious– It becomes the bottleneck in cache-prefetching techniques

• Each processor joins a pair of partitions in join phase

Main Memory

Shared bus

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Build1

Probe1

Build4

Probe4

Build2

Probe2

Build3

Probe3

Page 12: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 12

Inspector Joins • Extracted information: summary of matching relationships

– Every K contiguous pages in a build partition forms a sub-partition

– Tells which sub-partition(s) every probe tuple matches

Build Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

Probe Partition

I/O Partitioning Join

Summary of Matching

Relationship

Page 13: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 13

Cache-Stationary Join Phase

• Recall cache partitioning: re-partition cost

I/O Partitioning Join

Build PartitionProbe Partition

Hash TableCPU

Cache

• We want to achieve zero copying

Copying cost

Copying cost

Page 14: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 14

Cache-Stationary Join Phase

• Joins a sub-partition and its matching probe tuples• Sub-partition is small enough to fit in CPU cache• Cache prefetching for the remaining cache misses

• Zero copying for generating recursive cache-sized partitions

I/O Partitioning Join

Build PartitionProbe Partition

Hash TableCPU

CacheSub-partition 0

Sub-partition 1

Sub-partition 2

Page 15: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 15

Filters in I/O Partitioning

• How to extract the summary efficiently?• Extend filter scheme in commercial hash joins• Conventional single-filter scheme

– Represent all build join keys– Filter out probe tuples having no matches

Build Relation

Filter

Mem-sized

PartitionsConstruct Test

I/O Partitioning Join

Probe Relation

Page 16: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 16

Background: Bloom Filter• A bit vector

– A key is hashed d (e.g. d=3) times and represented by d bits

• Construct: for every build join key, set its 3 bits in vector• Test: given a probe join key, check if all its 3 bits are 1

– Discard the tuple if some bits are 0– May have false positives

0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1

Bit0=H0(key)

Bit1=H1(key)

Bit2=H2(key)

Filter

Page 17: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 17

Multi-Filter Scheme• Single filter: a probe tuple entire build relation• Our goal: a probe tuple sub-partitions• Construct a filter for every sub-partition

• Replace a single large filter with multiple small filters

Single Filter

Build Relatio

n

Partition 0

Partition 1

Partition 2

Sub0,0Sub0,1Sub0,2

Sub1,0Sub1,1Sub1,2

Sub2,0Sub2,1Sub2,2

Multi-Filter

I/O Partitioning Join

Page 18: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 18

Testing Multi-FiltersWhen partitioning the probe relation

• Test a probe tuple against all the filters of a partition

• Tells which sub-partition(s) the tuple may have matches

• Store summary of matching relationships in partitions– This information is used to extract probe tuples in the order of partition IDs. A

special array is constructed using count sort technique for this purpose.

Probe Relation

Partition 0

Partition 1

Partition 2

Multi-Filter

Test

I/O Partitioning Join

Page 19: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 19

Cont’d…

• Extracting probe tuple information for every sub-partition using counting sort

– One array for each sub partition. Size of the array is number of matching probe tuples for that partition.

– The tuples are never visited or copied in the coutning sort.

• Joining pair of build and probe sub-partitions

Page 20: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 20

Minimizing Cache Misses for Testing Filters

• Single filter scheme: – Compute 3 bit positions– Test 3 bits

• Multi-filter scheme: if there are S sub-partitions in a partition– Compute 3 bit positions– Test the same 3 bits for every filter, altogether 3*S bits

• May cause 3*S cache misses !

Test

Probe Relation

Partition 0

Partition 1

Partition 2

Multi-Filter

001

111

011S filters

Page 21: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 21

Vertical Filters for Testing

• Bits at the same position are contiguous in memory• 3 cache misses instead of 3*S cache misses!

• Horizontal vertical conversion after partitioning build relation– Very small overhead in practice

Probe Relation

Partition 0

Partition 1

Partition 2

Test001

111

011

S filters

Contiguous in

memory

I/O Partitioning Join

Page 22: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 22

Outline

• Motivation• Previous hash join algorithms• Hash join performance on SMP systems• Inspector join• Experimental results• Conclusions

Page 23: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 23

Experimental Setup• Relation schema: 4-byte join attribute + fixed length payload• No selection, no projection• 50MB memory per CPU available for the join phase• Same join algorithm run on every CPU joining different partitions

• Detailed cycle-by-cycle simulations– A shared-bus SMP system with 1.5GHz processors

– Memory hierarchy is based on Itanium 2 processor

Page 24: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 24

Partition Phase Wall-Clock Time

• I/O partitioning can take advantage of multiple CPUs– Cut input relations into equal-sized chunks – Partition one chunk on every CPU– Concatenate outputs from all CPUs

• Enhanced cache partitioning: cache partitioning + advanced prefetching• Inspection incurs very small overhead

– Ratio of execution time with best algo- 0.88 to 0.94– Mainly computation cost of converting horizontal filters to vertical and testing

GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join

•500MB joins 2GB•100B tuples, 4B keys•50% probe tuples no matches

•A build matches 2 probe tuples

Number of CPUs used

Page 25: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 25

Join Phase Aggregate Time

• Inspector join achieves significantly better performancewhen 8 or more CPUs are used

– Because of local optimization + catch prefetching– 1.7-2.1X speedups over cache prefetching

• Memory B/W becomes bottleneck when more no of processors are used– 1.6-2.0X speedups over enhanced cache partitioning

•500MB joins 2GB•100B tuples, 4B keys•50% probe tuples no matches

•A build matches 2 probe tuples

Number of CPUs used

GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join

Page 26: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 26

Results on Choosing Suitable Join Phase

• Case #1: a large number of duplicate build join keys– Choose enhanced cache partitioning

– When a probe tuple on average matches 4 or more sub-partitions

• Case #2: nearly sorted input relations– Surprisingly: cache-stationary join is very good

I/O Partitioning

decide Cache

PartitioningCache Prefetching

Simple Hash JoinInspection

Join Phase

Cache StationaryExtracted Info

Page 27: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 27

Conclusions• Exploit multi-pass structure for higher quality info about data• Achieve significantly better cache performance

– 1.6X speedups over previous cache-friendly algorithms

– When 8 or more CPUs are used

• Choose most suitable algorithms for special input cases• Idea may be applicable to other multi-pass algorithms

Page 28: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 28

Thank You !

Page 29: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 29

Previous Algorithms on SMP Systems

• Join phase performance of joining a 500MB and a 2GB relations (details later in the talk)

• Aggregate performance degrades dramatically over 4 CPUs

Reduce data movement (memory to memory, memory to cache)

Wall clock time Aggregate time on all CPUsGRACE

Cache partitioningCache prefetching

Number of CPUs used

Re-partition

cost

Number of CPUs used

Bandwidth-sharing

Page 30: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 30

More Details in Paper• Moderate memory space requirement for filters• Summary information representation in intermediate partitions• Preprocessing for cache-stationary join phase• Prefetching for improving efficiency and robustness

Page 31: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 31

Partition Phase Wall-Clock Time

• I/O partitioning can take advantage of multiple CPUs– Cut input relations into equal-sized chunks

– Partition one chunk on every CPU

– Concatenate outputs from all CPUs

• Inspection incurs very small overhead

•500MB joins 2GB•100B tuples, 4B keys•50% probe tuples no matches

•A build matches 2 probe tuples

Number of CPUs used

GRACECache prefetchingCache partitioningInspector join

Page 32: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 32

Join Phase Aggregate Time

• Inspector join achieves significantly better performancewhen 8 or more CPUs are used– 1.7-2.1X speedups over cache prefetching

– 1.6-2.0X speedups over enhanced cache partitioning

•500MB joins 2GB•100B tuples, 4B keys•50% probe tuples no matches

•A build matches 2 probe tuples

Number of CPUs used

GRACECache prefetchingCache partitioningInspector join

Page 33: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 33

CPU-Cache-Friendly Hash Joins• Recent studies focus on CPU cache performance

– I/O partitioning gives good I/O performance– Random memory accesses cause poor CPU cache performance

• Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00]– Recursively produce cache-sized partitions from memory-sized

partitions– Avoid cache misses during join phase– Pay re-partitioning cost

• Cache Prefetching [Chen et al. 04]– Exploit memory system parallelism– Use prefetches to overlap multiple cache misses and computations

Hash Table

ProbeBuild

Page 34: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 34

Example Special Input Cases• Example case #1: a large number of duplicate build join keys

– Count the average number of sub-partitions a probe tuple matches

– Must check the tuple against all possible sub-partitions

– If too large, cache stationary join works poorly

• Example case #2: nearly sorted input relations– A merge-based join phase might be better?

Build Partition

Probe Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

A probe tuple

Page 35: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 35

Varying Number of Duplicates per Build Join Key

• Join phase aggregate performance• Choose enhanced cache part

– When a probe tuple on average matches 4 or more sub-partitions

Page 36: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 36

Nearly Sorted Cases

• Sort both input relations, then randomly move 0%-5% of tuples• Join phase aggregate performance• Surprisingly: cache-stationary join is very good

– Even better than merge join when over 1% tuples are out-of-order

Page 37: Inspector Joins

Inspector Joins IC-65 Advances in Data Management Systems 37

Analyzing Nearly Sorted Case• Partitions are also nearly sorted• Probe tuples matching a sub-partition are almost contiguous• Similar memory behavior as merge join• No cost for sorting out-of-order tuples

Build Partition

Probe Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

A probe tuple

Nearly Sorted Nearly Sorted