27
2/4/2010 1 Finishing Up from Tuesday Topics Cursors Hashing

Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

  • Upload
    lytruc

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 11

Finishing Up from Tuesday• Topics

• Cursors• Hashing

Page 2: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 2

Cursors

cat dog elephant mouse

cursor

c_get elephantdelete currentwhere is cursor?insert kangarooinsert eagle

• They mark a position in the tree (used in iterating over a file).• Cannot lose the position in the face of a delete.

• Subsequent inserts must happen in the right spot.• Requires retaining the key value.• With multiple deletes and multiple cursors, you have to maintain positioning

between cursors

2/2/10 2

Page 3: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 3

Hashing• Your index is a collection of buckets (bucket = page)• Define a hash function, h, that maps a key to a bucket.• Store the corresponding data in that bucket.• Collisions

• Multiple keys hash to the same bucket.• Store multiple keys in the same bucket.

• What do you do when buckets fill?• Chaining: link new pages(overflow pages) off the bucket.• Open-hashing: look in the next bucket.

• Chaining versus open-hashing• Open-hashing does not support deletion well.

2/2/10

3

Page 4: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 4

Hash Example

• Assume:• H(cat) = 0• H(dog) = 1• H(mouse) = 0

• Operations1. Insert cat2. Insert dog3. Insert mouse4. Delete dog5. Lookup mouse

2/2/10 4

cat

dog

mouse

mouse

Page 5: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 5

Static vs Dynamic Hashing

• Static: number of buckets predefined; never changes.• Either, overflow chains grow very long, OR• A lot of wasted space in unused buckets.

• Dynamic: number of buckets changes over time.• Hash function must adapt.• Usually, start revealing more bits of the hash value as the

table grows.

2/2/10 5

Page 6: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 6

Practical Hashing (1)

• Buckets map to pages.• Must be able to directly translate from a bucket

number to a page number.• Where do you store overflow pages?• If number of buckets is fixed (static hashing), store overflow

buckets after regular buckets.• Use free list to manage overflow buckets.

• Static hashing isn’t very practical for databases.• Databases change in size fairly substantially.• If you have to preallocate, often waste space.

2/2/10 6

Page 7: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 7

Practical Hashing (2)

• Dynamic hash implementation.• Periodically double the size of the database.

• Rehash every key into new table.

• Dynamic Linear Hashing (Litwin)• Grow table one bucket at a time.• Split buckets sequentially; rehash just the splitting bucket.• Maintain overflow buckets as necessary.• Keep track of max bucket to identify the correct number of

bits to consider in the hash value.

2/2/10 7

Page 8: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 8

Using BDB from Tcl

• Topics• An Introduction to Tcl• The Berkeley DB Tcl API• Tools for performance tuning and analysis

• Learning Objectives• Write simple programs in Tcl• Create environments and databases in Tcl• Perform get, put, cursor, del operations• Use timing and statistics to analyze the behavior of Berkeley

DB databases.

Page 9: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 9

What is Tcl?• Toolkit command language -- a scripting language• Designed to be embedded easily into other systems.• Berkeley DB provides Tcl extensions (new

commands) that let you access BDB functionalityfrom a Tcl-based shell.

• Logistics:• I will use the Tcl installation on FAS/NICE.• You can do assignment 1 on nice or on your own machine• NOTE: if you intend to use your own machine, install Tcl and

BDB; do NOT wait until the night before assignment 1 isdue. We will not answer build/install questions the 24 hoursbefore the assignment is due.

Page 10: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 10

Getting Started (on FAS)

• You need to know where to find the appropriateexecutables and where to find the appropriate sharedlibraries.

• Edit your .cshrc file and add the following two lines:setenv PREPATH /nfs/home/c/s/cs165/binsetenv LD_LIBRARY_PATH /nfs/home/c/s/cs165/lib

• Log out• Log back in

Page 11: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 11

Getting Started (with Tcl)• Start up Tcl interpreter:

ice% tclsh• Variables:

• Untyped• Variables need not be declared; created as you need them• Variable names are alphanumeric strings that begin with a

character:foo, a, dog, b4

• You assign values to variables using set, e.g.,% set foo 4% set bar “cat”

• You access the value of a variable using the $ symbol:% puts $foo% puts $bar

Page 12: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 12

Calculating

• Numerical evaluation is accomplished via the exprcommand:% expr 1 + 3% set foo 4% set bar 5% expr $foo + $bar% puts [expr 3 + 4]

Page 13: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 13

Control Flow• All you should need are if statements and for loops.• You need two additional pieces of syntax:

• Tcl uses {} for grouping.• Tcl uses [] for evaluation

• By evaluation, we mean how you tell Tcl toevaluate anexpression so that you can assign it to a variable.

• For example:set i [expr $foo + $bar]

• Sets i to the result of evaluating [expr $foo + $bar]• With all this in hand, if statements should look pretty natural:

if { boolean expression } {do stuff here

} else {do other stuff here

}

Page 14: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 14

Boolean Statementsif { $foo == 4 } {

# This is a comment character# You can now do stuff conditionally

}

if { $foo < 10 } {# Do something

} else {# Do something else

}# Tcl is whitespace sensitive, so positioning# your {} actually matters!

Page 15: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 15

FOR Loopsfor { init } {condition } { loop increment } {

# do stuff}

• So, to loop from 0-9:for { set i 0 } { $i < 10 } { incr i } {

# do stuff

}

• Note: incr i is shorthand for incr i 1 which isshorthand for set i [expr $i + 1]

Page 16: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 16

BDB + Tcl• The Berkeley DB library and its interface to the Tcl language is

dynamically loaded using a series of commands that can befound in ~cs165/tools/loadme.tcl.

• You type:ice% tclsh% source ~cs165/tools/loadme.tcl

• That file contains:lappend auto_path /nfs/home/c/s/cs165/libpkg_mkIndex /nfs/home/c/s/cs165/liblibdb_tcl-4.8.soload /nfs/home/c/s/cs165/lib/libdb_tcl-4.8.so

• Now you can access Berkeley DB commands.• In general, you’ll use the berkdb command to create handles

and then you’ll use those handles to execute methods.

Page 17: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 17

The berkdb Command

• Used to create/open environments and databases.• Environments: make sure the directory exists.• Let’s call our home directory work.

% set e [berkdb env -create -home work]

• Now, examine e:% puts $e

• The variable e contains a new command thatrepresents the environment.

• That command implements other commands that arethe methods off of the environment.

Page 18: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 18

• If you do not specify an environment, then you can open adatabase, but you have opened it OUTSIDE the environment.

• Note: if you want to examine things like the memory poolstatistics, you need an environment -- more on this later).

• Compare the following two commands:% set dba [berkdb open -create -btree mybtree1.db]% set dbb [berkdb open -create -env $e -btree mybtree2.db]

• How are they different?

• Guess how to create a hash table?

Creating/Opening Databases

• dba is NOT in an environment; db created in current directory)• dbb IS in an environment and will be created in work

% set db [berkdb open -create -env $e -hash myhash.db]

Page 19: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 19

• Both environment and database handles/commands take“methods” to perform operations.

• The put method adds data to a database.% $db put dog fido% $db put cat fluffy

• How do you suppose you get data out of the database?

• Just like with other Tcl commands, you can assign the results ofthese calls:% set dogval [$db get dog]

Adding Data to a Database

% $db get dog% $db get cat% $db get elephant

Page 20: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 20

Putting it all Together

• Let’s put this all together and write a loop that adds10 items to a database:% for {set i 0} {$i < 10} {incr i} {$db put key$i data$i

}

• This inserts 10 key/data pairs that look like:{key0 data0} {key1 data1} ... {key9 data9}

• We can retrieve those values:% $db get key3% $db get key8

Page 21: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 21

Cursors• The last handle/command you’ll need is a cursor, which is used to iterate over a

collection of data.• Cursors are associated with databases, so we create a cursor using a database

method:% set c [$db cursor]

• You can perform the same operations with cursors that you do with databases,plus you can use a cursor for iteration.% $db put dog fido

• is the same as:% $c put -keyfirst dog fido

• and% $db get dog

• is the same as% $c get -set dog

• except that the cursor version leaves the cursor referencing the item, so you canalso issue get methods relative to that position (e.g.,current, next, prev)

Page 22: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 22

More Cursors• Cursor get takes an option, -set, before specifying the key,

because the cursor get method supports more operations thanthe database get method. In particular, it supports options like:• -first• -next• -last• -prev

• What do you suppose the following does?% for { set pair [$c get -first] } \ { $pair != ““ } \ { set pair [$c get -next] } {

puts $pair}

Page 23: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 23

Performance Analysis• Why?

• Nearly every hard database problem boils down to performance.• It is useful to learn what performance analysis tools are available

for any given data management technology.• Our toolkit:

• In Tcl:• time command: measures time to execute a Tcl command (in microseconds).% time { command_goes_here }• Caveats: includes Tcl parsing time and such (sometimes significant)

• In Berkeley DB:• db_stat: produces statistics about individual databases as well as about

Berkeley DB’s own subsystems.

Page 24: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 24

db_stat

• For now, we’ll focus on only two uses of db_stat.• Individual database statistics• Memory Pool Statistics

• db_stat for databases• Usage:

ice% db_stat -d database

• ORice% db_stat -h HOME -d database

Page 25: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 25

• What is a memory pool?• Recall that memory is fast and disk is slow.• Goal: grab data from memory whenever possible.

• How?

• Berkeley DB is not the only one who maintains a memory pool;the operating system does as well (frequently called the buffercache).

Memory Pools

Tcl w/Berkeley DB

Berkeley DB memory pool (mpool)

Operating SystemFile System Buffer Cache

to disk

• Keep recently used data in memory in the hope that you’ll use it again real soon

Page 26: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 26

Data Movement

• When you try to read a key, what really happens is:• Berkeley DB figures out on what page that key lives.• Berkeley DB looks in its mpool.• If the page is there, you get your key (quickly).• If the page isn’t there, Berkeley DB makes space in its

mpool and then requests the page from the file system.• If the page is in the file system buffer cache, it is given to

Berkeley DB (relatively quickly).• If the pages is not in the buffer cache, then it is requested

from disk.

Page 27: Finishing Up from Tuesday - Harvard Universitysites.fas.harvard.edu/~cs165/notes/bdbtcl-slides.pdf · Static vs Dynamic Hashing •Static: number of buckets predefined; never changes

2/4/2010 27

db_stat for mpool

• You must be using an environment in order toexamine the memory pool statistics.

• Summary statisticsice% db_stat -h HOME -m

• Detailed statisticsice% db_stat -h HOME -M A

• Resetting statisticsice% db_stat -Z -M A -h HOME