Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered...

Preview:

Citation preview

Computer Science 112

Fundamentals of Programming IIImplementation Strategies for Unordered Collections

What They Are

• Bag - a collection of items in no particular order

• Set - a collection of unique items in no particular order

• Dictionary - a collection of values associated with unique keys

Variations

• SortedBag - a bag that allows clients to access items in sorted order

• SortedSet - a set that allows clients to access items in sorted order

• SortedDictionary - a dictionary that allows clients to access keys in sorted order

Sorted Set and Dictionary Implementations

• Array-based, using a sorted list

• Linked, using a linked binary search tree

• Must keep the tree balanced; insertions and removals will then be logarithmic as well

d.isEmpty()

len(d)

iter(d) # Iterate through the keys

str(d)

key in d

d.get(key, defaultValue = None)

item = d[key]

d[key] = item # Add or replace

d.pop(key, defaultValue = None)

d.entries() # A set of entries

d.keys() # An iterator on the keys

d.values() # An iterator on the values

Dictionary Interface

Dictionary Implementations

• Array-based (like ArraySet and ArraySortedSet)

• Linked structure (like LinkedSet and TreeSortedSet)

• All use an Entry class to contain the key/value pair

Possible Organization I

ArraySet LinkedSet

AbstractBag

AbstractCollection

ArrayDict LinkedDict

Is a dictionary just a type of set with some additional methods?

ArrayBag LinkedBag

Possible Organization II

ArrayBag LinkedBag

AbstractBag

AbstractCollection

ArrayDict LinkedDict

AbstractDict

Which methods are implemented in AbstractDict?

ArraySet LinkedSet

class Entry(object):

def __init__(self, key, value): self.key = key self.value = value

def __eq__(self, other): if type(self) != type(other): return False return self.key == other.key

def __lt__(self, other): if type(self) != type(other): return False return self.key < other.key

def __le__(self, other): if type(self) != type(other): return False return self.key <= other.key

The Entry Class

Goes in abstractdict.py, where all dictionaries can see it

from abstractcollection import AbstractCollection

class AbstractDict(AbstractCollection):

def __init__(self): AbstractCollection.__init__(self, None)

def __str__(self): return " {" + ", ".join(map(lambda entry: str(entry.key) + \ ":" + str(entry.value), self.entries())) + "}"

The AbstractDict Class

{2:3, 6:7}

Can We Do Better?

• If we could associate each unordered set element or each unordered dictionary key with a unique index position in an array, we could have

– Constant-time search– Constant-time insertion– Constant-time removal

Hashing

• Each data element has a unique hash value, which is an integer

• This value can be computed in constant time by a hash function

• This computation can be performed on each insertion, access, and removal

How Are the Elements Stored?

• The hash value is used to locate the element’s index in an array, thus preserving constant-time access

• How to compute this:

hashValue % capacity of array

Position will be >= 0 and < capacity

def __contains__(self, item): index = abs(hash(item)) % len(self._array) return self._array[index] != None

A Sample Access Method (Set)

• self._array is an array of items

• len(self._array) is the array’s current physical size

• hash(item) is a function that returns an item’s hash value

• Other access methods have a similar structure

A Sample Mutator Method (Set)

def add(self, item): if not item in self: index = abs(hash(item)) % len(self._array) self._array[index] = item

Adding Items

A

mySet.add("A")

index = 10

Adding Items

B A

mySet.add("B")

index = 5

Adding Items

C B A

mySet.add("C")

index = 0

Adding Items

C B A D

index = 14

mySet.add("D")

Adding Items

C B

Add 12 more items

A D

Adding Items

C E M Q B N F K T W A G L Y I D

Array is fullResize the array and rehash all elements

Performance

• O(1) lookups, insertions, removals - wow!

• Cost of resizing the array is amortized over many insertions and removals

• Works as long as hashValue % capacity is not the same for two items

Problem: Collisions

• As more elements fill the array, the likelihood that their hash values map to the same array position increases

• A collision then occurs: that is, items compete for the same position in the array

def testHash(arrayLength = 10, numberOfItems = 5): print(" Item hash code array index") for i in range(1, numberOfItems + 1): item = "Item" + str(i) code = hash(item) index = abs(code) % arrayLength print("%7s%12d%8d" % (item, code, index))

A Tester Program

Load Factor

• An array’s load factor expresses the ratio of the number of elements to its capacity

• Example: elements(10) / length(30) = .3333

• Try to keep load factor low to minimize collisions

• Does waste some memory, though

Collision Processing Strategies

• Linear collision processing - search for the next available empty slot in the array, wrapping around if the end is reached

• Can lead to clustering, where several elements that have collided now occupy consecutive positions

• Several small clusters may coalesce into a large cluster and thus degrade performance

Collision Processing Strategies

• Rehashing - run one or more additional hash functions until a collision does not occur

• Works well when the load factor is small

• Multiple hash functions may contribute a large constant of proportionality to the running time

Collision Processing Strategies

• Quadratic collision processing - Move a considerable distance from the initial collision

• Does not require other rehashing functions

• When k is the collision position, we enter a loop that repeatedly attempts to locate an empty position

k + 12 // The first attempt to locate a positionk + 22 // The second attempt to locate a positionk + r2 // The rth attempt to locate a position

Collision Processing Strategies

• Chaining– Each hash value specifies an index or bucket in

the array– This bucket is at the head of a linked structure

or chain of items with the same hash value

Some Buckets and Chains

D5 D2

D6 D4

D8

D3 D1D7

0

1

2

3

4

index

# Instance variables for locating data

self._foundEntry # Pointer to item just located # undefined if not foundself._priorEntry # Pointer to item prior to one just located # undefined if not foundself._index # Index of chain in which item was located # undefined if not found # Instance variables for data

self._array # the array of collision listsself._size # number of items in the set

HashSet Data

Extra instance variables support pointer manipulationsduring insertions and removals

from node import Nodefrom abstractset import AbstractSetfrom abstractcollection import AbstractCollection

class HashSet(AbstractCollection, AbstractSet):

DEFAULT_CAPACITY = 1000;

def __init__(self, sourceCollection = None): self._array = Array(HashSet.DEFAULT_CAPACITY) self._foundEntry = self._priorEntry = None self._index = -1 AbstractCollection.__init__(self, sourceCollection)

HashSet Initialization

Uses singly linked nodes for the collision lists

def __contains__(self, item): self._index = abs(hash(item)) % len(self._array) self._priorEntry = None self._foundEntry = self._array[self._index] while self._foundEntry != None: if self._foundEntry.data == item: return True else: self._priorEntry = self._foundEntry self._foundEntry = self._foundEntry.next return False

HashSet Searching

If this method returns True, the instance variables _index, _foundEntry, and _priorEntry allow other methods to locate and manipulate an item in the array’s collision list efficiently

def add(self, item): if not item in self: newEntry = Node(item, self._array[self._index]) self._array[self._index] = newEntry self._size += 1

HashSet Insertion

Link to head of chain

def remove(self, item): if not item in self: raise KeyError(str(item) + " not in set") elif self._priorEntry is None: self._array[self._index] = self._foundEntry.next else: self._priorEntry.next = self._foundEntry.next self._size -= 1

HashSet Removal

Performance of Chaining

• If chains are evenly distributed across the array, close to O(1)

• If one or two chains get very long, processing tends to be linear

• Can use a large array but wastes memory

• On the average and for the most part, close to O(1)

For Friday

Introduction to Graphs (Chapter 20)

Recommended