Kruse/Ryba ch091 Object Oriented Data Structures Tables and Information Retrieval Rectangular Tables Tables of Various Shapes Radix Sort Hashing

Kruse/Ryba ch09 1

Object Oriented Data Structures

Tables and Information RetrievalRectangular Tables

Tables of Various ShapesRadix SortHashing

Kruse/Ryba ch09

What is an INDEX?

An index lets you impose order on a file without actually rearranging the file.An index gives keyed access to fixed or variable-length record files.

Kruse/Ryba ch09

Simple Index

A simple index uses a simple array to implement the index.

Called by IBM ISAM (Indexed Sequential Access Method)

Kruse/Ryba ch09

ANG3795 167

COL31809 353

COL38358 211

DG139201 396

DG18807 256

FF245 442

LON2312 32

MER75016 300

RCA2626 77

WAR23699 132

LON|2312|Romeo and Juliet|...

RCA|2626|Quartet in C Sharp...

WAR|23699|Topuchstone|...

ANG|3795|Symphony No. 9|...

COL|38358|Nebraska|...

DG|18807|Symphony No. 9|...

MER|75016|Coq d'or Suite|...

COL|31809|Symphony No. 9|...

DG|139201|Violin Concerto|...

FF|245|Good News|...

Indexfile

Key Reference

Datafile

Actual data record

32

77

132

167

211

256

300

353

396

442

Kruse/Ryba ch09

Concerns

Two files to deal withIndex file easier to deal with than data file because it has fixed-length recordsFixed-length fields impose limits on size of keysIn the example, the index carries no information other than the keys and the reference fields. Other data could be included. (length)

Kruse/Ryba ch09

Basic OperationsCreate the original empty index and data files.Load the index file into memory before using it.Rewrite the index file from memory after using it.Add records to the data file and index.Delete records from the data file.Update records in the data file.

Kruse/Ryba ch09

Creating the Files

Create both the index and data files as empty files. Write headers to both files.

Kruse/Ryba ch09

Loading the Index into Memory

Assume that the index file is small enough to fit into RAM.

Each array element is an index record.

Kruse/Ryba ch09

Safety Mechanisms

Know when the index is out of date.Be able to reconstruct the index from the data file.

Kruse/Ryba ch09

Record AdditionAdding a new record to the data file requires that we also add a record to the index file.

Kruse/Ryba ch09

ANG3795 167

COL31809 353

COL38358 211

DG139201 396

DG18807 256

FF245 442

LON2312 32

MER75016 300

RCA2626 77

WAR23699 132











Indexfile

Key Reference

Datafile

Actual data record

32

77

132

167

211

256

300

353

396

442

486 LON|783|Sweet Somthings|...

LON783 486

MER75016 300

RCA2626 77

Kruse/Ryba ch09

Record DeletionAny of the methods discussed in chapter 5 could be used. However, the index file must now be considered.

The index entry could be removed and the array adjusted or the index entry could just be marked as deleted.

Kruse/Ryba ch09

Record Updating

Updating changes the key field– conceptually, this is best thought of as a deletion

followed by an addition

Updating does not change a key field– this will not cause any changes in the index file but

could well cause changes in the data file if the size of the record changes.

Kruse/Ryba ch09

Indexes too large to fit in RAM

Essentially, the later text material deals with this problem.

Hashed Organization

Tree-structures

Kruse/Ryba ch09

Access by Multiple KeysBEETHOVEN ANG3795

BEETHOVEN DG139201

BEETHOVEN DG18807

BEETHOVEN RCA2626

COREA EAR23699

DVORAK COL31809

PROKOFIEV LON2312

RIMSKY-KORSAKOV MER75016

SPRINGSTEEN COL38358

SWEET HONEY IN THE FF245

Secondary keyorganized bycomposer

Kruse/Ryba ch09

Record Addition

Additional indices imply additional overhead when new records are added.

Kruse/Ryba ch09

Record DeletionThis usually implies removing all references to that record in the file system.

Since the primary index does reflect a deletion, a request from a secondary index will result in a failure, implying the record has been deleted.

Such a method would result in wasted space in the secondary index.

Kruse/Ryba ch09

Record Updating

If the update changes the secondary key– it may be necessary to rearrange the secondary key index so

it stays in sorted order

If the update changes the primary key– this creates a major impact on secondary indices

If the update is confined to other fields.– Updates that do not affect either the primary or secondary

key fields do not affect the secondary key index.

Kruse/Ryba ch09

Access by Multiple KeysCOQ D'OR SUITE MER75016

GOOD NEWS FF245

NEBRASKA COL38358

QUARTET IN C SHAR RCA2626

ROMEO AND JULIET LON2312

SYMPHONY NO. 9 ANG3795

SYMPHONY NO. 9 COL31809

SYMPHONY NO. 9 DG18807

TOUCHSTONE WAR23699

VIOLIN CONCERTO DG139201

Secondary keyorganized byrecording title

Kruse/Ryba ch09


GOOD NEWS FF245

NEBRASKA COL38358






TOUCHSTONE WAR23699


Find all data records with composer = BEETHOVENand title = SYMPHONY NO. 9

Kruse/Ryba ch09


GOOD NEWS FF245

NEBRASKA COL38358






TOUCHSTONE WAR23699



Kruse/Ryba ch09


BEETHOVEN DG139201

BEETHOVEN DG18807

BEETHOVEN RCA2626

COREA EAR23699

DVORAK COL31809

PROKOFIEV LON2312





Kruse/Ryba ch09


BEETHOVEN DG139201

BEETHOVEN DG18807

BEETHOVEN RCA2626

COREA EAR23699

DVORAK COL31809

PROKOFIEV LON2312





Kruse/Ryba ch09

ANG3795 167

COL31809 353

COL38358 211

DG139201 396

DG18807 256

FF245 442

LON2312 32

MER75016 300

RCA2626 77

WAR23699 132











Indexfile

Key Reference

Datafile

Actual data record

32

77

132

167

211

256

300

353

396

442

LOGICAL AND

Kruse/Ryba ch09

Problems

We have to rearrange the index file every time a new record is added to the file, even if the new record is from an existing secondary key.

Kruse/Ryba ch09

A Better Solution: Linking the List of References

Inverted lists work their way backward from a secondary key to the primary key to the record itself.

Kruse/Ryba ch09

BEETHOVEN

COREA

DVORAK

PROKOFIEV

ANG3795

DG139201

DG18807

RCA2626

WAR23699

COL31809

LON2312

Kruse/Ryba ch09

BEETHOVEN

COREA

DVORAK

PROKOFIEV

ANG3795

DG139201

DG18807

RCA2626

WAR23699

COL31809

LON2312

Might create a large numberof small files, one for eachcomposer.

Kruse/Ryba ch09

Improved Version

Redefine the secondary key index so it consists of records with two fields - a secondary key field, and a field containing the relative record number of the first corresponding primary key reference in the inverted list.

The actual primary key references associated with each secondary key would be stored in a separate entry-sequenced file.

Kruse/Ryba ch09

3BEETHOVEN

2COREA

7DVORAK

10PROKOFIEV

6RIMSKY-KORSAKOV

4SPRINGSTEEN

9SWEET HONEY IN

0

1

2

3

4

5

6

LON2312

RCA2626

WAR23699

ANG2795

COL38358

DG18807

MER75016

COL31809

DG139201

FF245

ANG3193

-1

-1

-1

8

-1

-1

-1

-1

5

-1

0

0

1

2

3

4

5

6

7

8

9

10

Secondary IndexFile

Lable ID List File

Kruse/Ryba ch09 31

Hash Functions

Truncation– Ignore part, use the rest for key

Folding– Partition and combine

Modular ArithmeticPerfect Hash Function

Kruse/Ryba ch09 32

int hash(const Key &target){ int value = 0; for (int position = 0; position < 8; position++) value = 4 * value + target.key_letter(position); return value % hash_size;}

C++ Example

Kruse/Ryba ch09 33

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

Kruse/Ryba ch09 34

Collision Resolution

Linear Probing– Clustering

RehashingIncrement FunctionsQuadratic Probing– h+i2

Key-Dependent Increments– Increment = (int)the_data.key_letter(0);

Random Probing

Kruse/Ryba ch09 35

Error_code Hash_table::insert(const Record &new_entry){ Error_code result = success; int probe_count, // be sure that table is not full. increment, // Increment used for quadratic probing. probe; // Position currently probed Key null; // Null key for comparison purposes. null.make_blank(); probe = hash(new_entry); probe_count = 0; increment = 1; while (table[probe] != null // Is the location empty? && table[probe] != new_entry // Duplicate key? && probe_count < (hash_size + 1) / 2) {// Has overflow occurred? probe_count++; probe = (probe + increment) % hash_size; increment += 2; // Prepare increment for next iteration. } if (table[probe] == null) table[probe] = new_entry; else if(table[probe] == new_entry) result = duplicate_error; else result = overflow; // The table is full. return result;}

Kruse/Ryba ch09 36

Collision Resolution with Buckets0

1

2

Kruse/Ryba ch09 37

Collision Resolution by Chaining

Kruse/Ryba ch09 38

Collision Resolution by Chaining

Advantages– Saving of space– Simple, efficient collision handling– Size of hash table does not need to exceed the

number of records– Deletion becomes quick and easy

Disadvantage– Links require space

Kruse/Ryba ch09 39

Theoretical ComparisonLoad factor 0.10 0.50 0.80 0.90 0.99 1.00

Successful search, expected number of probes:

Chaining 1.05 1.25 1.40 1.45 1.50 2.00

Open, random probes 1.05 1.40 2.0 2.6 4.6 -----

Open, linear probes 1.06 1.50 3.0 5.5 50.5 -------

Kruse/Ryba ch09 40

Theoretical ComparisonLoad factor 0.10 0.50 0.80 0.90 0.99 2.00

Unsuccessful search, expected number of probes:

Chaining 0.10 0.50 0.80 0.90 0.99 2.00

Open, random probes 1.1 2.00 5.0 10.0 100 -----

Open, linear probes 1.12 2.50 13. 50. 5000 -------

Kruse/Ryba ch09 41

Empirical ComparisonLoad factor 0.10 0.50 0.80 0.90 0.99 2.00

Successful search, expected number of probes:

Chaining 1.04 1.2 1.4 1.4 1.59 2.00

Open, quadratic probes 1.04 1.50 2.1 2.7 5.2 -----

Open, linear probes 1.05 1.60 3.4. 6.2 21.3 -------

Kruse/Ryba ch09 42

Empirical ComparisonLoad factor 0.10 0.50 0.80 0.90 0.99 2.00

Unsuccessful search, expected number of probes:

Chaining 0.10 0.50 0.80 0.90 0.99 2.00

Open, quadratic probes 1.13 2.20 5.2 11.9 126. -----

Open, linear probes 1.13 2.70 15.4. 59.8 430. -------

Kruse/Ryba ch09 43

(1). is retrieval table-Hash

).log( issearch Binary

).( issearch Sequential

n

n

Highlights

Kruse/Ryba ch09 44

Chapter 9 - The End

Documents

Kruse/Ryba ch091 Object Oriented Data Structures Tables and Information Retrieval Rectangular Tables Tables of Various Shapes Radix Sort Hashing