26
DWH-Ahsan Abdullah DWH-Ahsan Abdullah 1 Data Warehousing Data Warehousing Lecture-26 Lecture-26 Need for Speed: Need for Speed: Conventional Indexing Techniques Conventional Indexing Techniques Virtual University of Virtual University of Pakistan Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research www.nu.edu.pk/cairindex.asp National University of Computers & Emerging Sciences, Islamabad Email: [email protected]

Lecture 26

Embed Size (px)

Citation preview

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

11

Data Warehousing Data Warehousing Lecture-26Lecture-26

Need for Speed:Need for Speed:Conventional Indexing TechniquesConventional Indexing Techniques

Virtual University of PakistanVirtual University of Pakistan

Ahsan AbdullahAssoc. Prof. & Head

Center for Agro-Informatics Researchwww.nu.edu.pk/cairindex.asp

National University of Computers & Emerging Sciences, IslamabadEmail: [email protected]

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

22

Need For Indexing: SpeedNeed For Indexing: SpeedConsider searching your hard disk using the Windows SEARCH command.

Search goes into directory hierarchies. Takes about a minute, and there are only a few thousand files.

Assume a fast processor and (even more importantly) a fast hard disk.

Assume file size to be 5 KB. Assume hard disk scan rate of a million files per second. Resulting in scan rate of 5 GB per second.

Largest search engine indexes more than 8 billion pages At above scan rate 1,600 seconds required to scan ALL pages. This is just for one user! No one is going to wait for 26 minutes, not even 26 seconds.

Hence, a sequential scan is simply not feasible.

No text goes to graphics

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

33

Need For Indexing: Query ComplexityNeed For Indexing: Query Complexity

How many customers do I have in Karachi?

How many customers in Karachi made calls during April?

How many customers in Karachi made calls to Multan during April?

How many customers in Karachi made calls to Multan during April using a particular calling package?

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

44

Need For Indexing: I/O BottleneckNeed For Indexing: I/O Bottleneck

Throwing hardware just speeds up the CPU intensive tasks.

The problem is of I/O, which does not scales up easily.

Putting the entire table in RAM is very very expensive.

Therefore, index!

No text goes to graphics

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

55

Indexing ConceptIndexing Concept Purely physical concept, nothing to do with logical model.Purely physical concept, nothing to do with logical model.

Invisible to the end user (programmer), optimizer chooses it, Invisible to the end user (programmer), optimizer chooses it, effects only the speed, not the answer.effects only the speed, not the answer.

With the library analogy, the time complexity to find a book? The With the library analogy, the time complexity to find a book? The average time takenaverage time taken

Using a card catalog organized in many different ways i.e. author, Using a card catalog organized in many different ways i.e. author, topic, title etc and is sorted.topic, title etc and is sorted.

A little bit of extra time to first check the catalog, but it “gives” a A little bit of extra time to first check the catalog, but it “gives” a pointer to the shelf and the row where book is located.pointer to the shelf and the row where book is located.

The catalog has no data about the book, just an efficient way of The catalog has no data about the book, just an efficient way of searching.searching.

No text goes to graphics

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

66

Indexing GoalIndexing Goal

Look at as few blocks as possible to find the matching record(s)

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

77

Conventional indexing TechniquesConventional indexing Techniques DenseDense SparseSparse Multi-level (or B-Tree)Multi-level (or B-Tree)

Primary Index vs. Secondary IndexesPrimary Index vs. Secondary Indexes

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

88

Dense Index

10203040

50607080

90100110120

Data File

2010

4030

6050

8070

10090

Every key in the data file is represented in the index file

Dense Index: ConceptDense Index: Concept

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

99

Dense Index: Adv & Dis AdvDense Index: Adv & Dis Adv

Advantage:Advantage: A dense index, if fits in the memory, is very A dense index, if fits in the memory, is very

efficient in locating a record given a keyefficient in locating a record given a key

Disadvantage:Disadvantage: A dense index, if too big and doesn’t fit into the A dense index, if too big and doesn’t fit into the

memory, will be expensive when used to find a memory, will be expensive when used to find a record given its keyrecord given its key

No text goes to graphics

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1010

Sparse Index

10305070

90110130150

170190210230

Data File

2010

4030

6050

8070

10090

Normally keeps Normally keeps only one key per only one key per data blockdata block

Some keys in the Some keys in the data file will not data file will not have an entry in have an entry in the index filethe index file

Sparse Index: ConceptSparse Index: Concept

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1111

Sparse Index: Adv & Dis AdvSparse Index: Adv & Dis Adv

Advantage:Advantage: A sparse index uses less space at the expense of A sparse index uses less space at the expense of

somewhat more time to find a record given its keysomewhat more time to find a record given its key

Support multi-level indexing structureSupport multi-level indexing structure

Disadvantage:Disadvantage: Locating a record given a key has different Locating a record given a key has different

performance for different key valuesperformance for different key values

No text goes to graphics

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1212

Sparse 2nd level

1090

170250

330410490570

Data File

2010

4030

6050

8070

10090

10305070

90110130150

170190210230

Sparse Index: Multi levelSparse Index: Multi level

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1313

B-tree Indexing: ConceptB-tree Indexing: Concept

Can be seen as a general form of multi-level Can be seen as a general form of multi-level indexes.indexes.

Generalize usual (binary) search trees (BST). Generalize usual (binary) search trees (BST).

Allow efficient and fast exploration at the expense of Allow efficient and fast exploration at the expense of using slightly more space.using slightly more space.

Popular variant: BPopular variant: B++-tree-tree

Support more efficiently queries like:Support more efficiently queries like:SELECT * FROM R WHERE a = 11SELECT * FROM R WHERE a = 11

SELECT * FROM R WHERE 0<= b and b<42SELECT * FROM R WHERE 0<= b and b<42

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1414

20

0

22

02

50

28

0

130

B-tree Indexing: ExampleB-tree Indexing: Example

Each node stored in one disk blockR

ID li

st

9 20

100

140

145

200

210

215

220

230

250

256

279

280

300

Looking for Empno 250

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1515

B-tree Indexing: LimitationsB-tree Indexing: Limitations

If a table is large and there are fewer unique values. Capitalization is not programmatically enforced

(meaning case-sensitivity does matter and “FLASHMAN" is different from “Flashman").

Outcome varies with inter-character spaces. A noun spelled differently will result in different

results.

Insertion can be very expensive.

Nothing will go to graphics

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1616

B-tree Indexing: Limitations ExampleB-tree Indexing: Limitations ExampleGiven that MOHAMMED is the most common first name in Pakistan, a 5-million row Customers table would produce many screens of matching rows for MOHAMMED AHMAD, yet would skip potential matching values such as the following:

VALUE MISSED REASON MISSED

Mohammed Ahmad Case sensitive

MOHAMMED AHMED AHMED versus AHMAD

MOHAMMED AHMAD Extra space between names

MOHAMMED AHMAD DR DR after AHMAD

MOHAMMAD AHMAD Alternative spelling of MOHAMMAD

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1717

Hash Based IndexingHash Based Indexing

You may recall that in internal memory, hashing can You may recall that in internal memory, hashing can be used to quickly locate a specific key.be used to quickly locate a specific key.

The same technique can be used on external memory.The same technique can be used on external memory.

However, advantage over search trees is smaller in However, advantage over search trees is smaller in external search than internal. external search than internal. WHY?WHY?

Because part of search tree can be brought into the Because part of search tree can be brought into the main memory.main memory.

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1818

Hash Based Indexing: ConceptHash Based Indexing: ConceptIn contrast to B-tree indexing, hash based indexes do not In contrast to B-tree indexing, hash based indexes do not (typically) keep index values in sorted order.(typically) keep index values in sorted order.

Index entry is found by hashing on index value requiring Index entry is found by hashing on index value requiring exact match.exact match.

SELECT * FROM Customers WHERE AccttNo= 110240SELECT * FROM Customers WHERE AccttNo= 110240

Index entries kept in hash organized tables rather than B-Index entries kept in hash organized tables rather than B-tree structures.tree structures.

Index entry contains ROWID values for each row Index entry contains ROWID values for each row corresponding to the index value.corresponding to the index value.

Remember few numbers in real-life to be useful for hashing.Remember few numbers in real-life to be useful for hashing.

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

1919

.

.

.

records

.

.

key h(key) disk block

Note on terminology:The word "indexing" is often used synonymously with "B-tree indexing".

Hashing as Primary IndexHashing as Primary Index

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2020

key h(key)

Index

recordkey

Can always be transformed to a secondary index using indirection, as above.

Indexing the Index

Hashing as Secondary IndexHashing as Secondary Index

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2121

Indexing (using B-trees) good for range Indexing (using B-trees) good for range searches, e.g.:searches, e.g.:SELECT * FROM R WHERE A > 5SELECT * FROM R WHERE A > 5

Hashing good for match based searches, Hashing good for match based searches, e.g.:e.g.:SELECT * FROM R WHERE A = 5SELECT * FROM R WHERE A = 5

B-tree vs. Hash IndexesB-tree vs. Hash Indexes

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2222

Primary Key vs. Primary IndexPrimary Key vs. Primary Index Relation Students

Name ID dept AHMAD 123 CS Akram 567 EE Numan 999 CS

Primary Key & Primary Index:Primary Key & Primary Index:

PK is ALWAYS unique.PK is ALWAYS unique.

PI can be unique, but does not have to be.PI can be unique, but does not have to be.

In DSS environment, very few queries are PK based.In DSS environment, very few queries are PK based.

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2323

Primary index selection criteria:Primary index selection criteria:

Common join and retrieval key.Common join and retrieval key.

Can be unique UPI or non-unique NUPI.Can be unique UPI or non-unique NUPI.

Limits on NUPI.Limits on NUPI.

Only one primary index per table (for hash-based Only one primary index per table (for hash-based file system).file system).

Primary Indexing: CriterionPrimary Indexing: Criterion

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2424

Primary Indexing Criteria: ExamplePrimary Indexing Criteria: Example

What should be the primary index of the call table for a What should be the primary index of the call table for a large telecom company?large telecom company?

call_id decimal (15,0) NOT NULL

caller_no decimal (10,0) NOT NULL

call_duration decimal (15,2) NOT NULL

call_dt date NOT NULL

called_no decimal (15,0) NOT NULL

Call TableCall Table

No simple answer!!No simple answer!!

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2525

Almost all joins and retrievals will occur through Almost all joins and retrievals will occur through the caller _no foreign key.the caller _no foreign key. Use caller_no as a NUPI.Use caller_no as a NUPI.

In case of non uniform distribution on caller_no In case of non uniform distribution on caller_no or or

if phone number have very large number of if phone number have very large number of outgoing calls (e.g., an institutional number could outgoing calls (e.g., an institutional number could easily have several thousand calls).easily have several thousand calls). Use call_id as UPI for good data distribution.Use call_id as UPI for good data distribution.

Primary IndexingPrimary Indexing

DWH-Ahsan AbdullahDWH-Ahsan Abdullah

2626

For a hash-based file system, primary index is For a hash-based file system, primary index is free!free! No storage cost.No storage cost. No index build required.No index build required.

OLTP databases use a page-based file system OLTP databases use a page-based file system and therefore do not deliver this performance and therefore do not deliver this performance advantage.advantage.

Primary IndexingPrimary Indexing