Copyright © 2009-2011 by Curt Hill
Searching and Sorting
A Summary on Searching
The lesson from Neural Networks
• Neural networks are only used when there are no algorithms that always work
• We only use on hard problems• In NN there are never any absolute
answers• Instead each project is different and
we experiment with our options until we are happy
• We never are sure that this is the best answer we only hope that it is acceptable
Copyright © 2009-2011 by Curt Hill
Apply the lesson• This is the same problem with
constructing data structures for programs
• It is extremely rare for us to know in advance all the things that would make the decision easy:– The frequency or number of insertions,
deletions, lookups in an average run– The frequency distribution of the key– The density of the key– What will be the optimal container class– How the next revision will change all of
this
Copyright © 2009-2011 by Curt Hill
Therefore
• We make fuzzy choices based on incomplete information
• We then become good at spotting trends that favor one structure over another
• With that in mind let us come back and re-examine searching and sorting
Copyright © 2009-2011 by Curt Hill
Why consider both at once?
• Our containers will fall into one of three categories:– Unordered– Ordered by key– Ordered or partially ordered by something
other than key• In first case there is no notion of sorting• In rest there is
– In some cases we must sort before we get started searching
– In both cases an insert must do some type of partial sort to clean it up
– A delete may also affect the sorted order but is often easier to correct
Copyright © 2009-2011 by Curt Hill
Searching Arrays or Vectors
• Three areas to review here• Linear• Binary• Self-organizing lists
Copyright © 2009-2011 by Curt Hill
Linear searching• This is the best and worst• Advantages
– It is the easiest to code• Not much more than a for loop
– Does the best for small tables, typically less than 10
– Applicable to Lists as well as tables• Disadvantages
– Finding an item that is present in uniformly distributed array needs ½N probes
– Finding that an item is not present requires looking at each N items
– Clearly an O(N) algorithm which is the worst for a search
Copyright © 2009-2011 by Curt Hill
Why use?• When the advantages outweigh
the disadvantages• For small tables it is the preferred
choice• Often chosen early in a project
– If and when performance becomes a problem then upgrade the search based on what you now know about the project
– It may be that the vector is large but searched infrequently so it is not a problem
Copyright © 2009-2011 by Curt Hill
Sequential Search in C• There is a sequential search function
in stdlib.h• It will search an array of items using
a user defined comparison• The header is:void* lfind ( const void * key, const void * base, size_t * num, size_t * width, int (_USERENTRY * fcmp) (const void *, const void *));
Copyright © 2009-2011 by Curt Hill
Notes
• It uses void * to represent any pointer
• The key is a pointer to what is being searched for
• The base is an array, not necessarily of the same type as the key– It may contain the key and other stuff
• The array has num entries and each entry is width bytes long
Copyright © 2009-2011 by Curt Hill
The passed function
• fcmp is a user defined routine to compare the key with a base item
• Key is the first parameter• An array entry is the second• Returns zero for equal and
anything else for not equal• If the item is found then it returns
the pointer to it and NULL otherwise
Copyright © 2009-2011 by Curt Hill
Commentary
• Actually figuring out how to use this thing is probably harder than coding it from scratch
• However, it will generally use machine language statements
• Thus it should do better than any C style loop
• There is also in the C libraries:– A binary search we will see later– A quick sort routine
Copyright © 2009-2011 by Curt Hill
Example
Copyright © 2009-2011 by Curt Hill
int fcmpe(const void * a, const void *b){ if(*(int *)a == *(int *)b){ return 0; } return +1;}...size_t s = tablesize;int key;int * unsorted; // dynamic array...lfind(key, unsorted, &s, 4, fcmpe);
Commentary• The lfind is classic C• It is not a template function, but it
can be used much like a template function
• Must use:– void * pointers– Makes user specify the length– Requires a user-defined function for
comparison
• Then it will work on any array
Copyright © 2009-2011 by Curt Hill
STL considerations• The STL has a search which is
customarily interesting• It may search for an item or a
range of items– In any container
• The header looks like this:FI search(First1, Last1, First2, Last2)
Copyright © 2009-2011 by Curt Hill
STL Notes• The result and all parameters are
Forward iterators of the same container class type
• First1 through Last1 are n the container class to be searched
• First2 through Last2 may be in another container
• If First2=Last2 then just one item
Copyright © 2009-2011 by Curt Hill
STL Results
• If search finds it the result is the beginning of the sequence
• Otherwise it returns Last1• In order to use the stored types
must be suitable for the equality operator
• You may also provide your own predicate
Copyright © 2009-2011 by Curt Hill
Binary search
• The binary search requires a sorted table
• The sort order may be either ascending or descending– For this presentation assumed
ascending
Copyright © 2009-2011 by Curt Hill
Basic algorithm• Set low to 0, high to the last used• While low < high
– Set mid to be halfway between low and high
– Compare the mid item with key– If the mid item is equal you are done– If the mid item is less than the key
• Remove the lower half of the table• Set low to mid
– If the mid item is greater than the key• Remove the upper half of the table• Set high to mid
Copyright © 2009-2011 by Curt Hill
Commentary• The loop terminates when we find
item or the high and low bounds collapse
• We determine which after the loop• The advantages
– The search is O(log2N) because at each iteration we eliminate half of what is left
• The disadvantages– The loop is much more complicated
• Most people do not get it right the first time
– The array must be sorted before we get started
Copyright © 2009-2011 by Curt Hill
Sorting• Since sorting is either a O(N2) or
O(N log2N), this is a very serious ramification– You have to do quite a few searches
to pay for that sort
• If the table will allow insertions it complicates that as well– The search to find the item is log2N,
however the insertion may only be linear in an array, since we have to slide all the following items down one
Copyright © 2009-2011 by Curt Hill
C Function• There is a binary search function in
stdlib.h• It will search a sorted array of items
using a user defined comparisonvoid* bsearch ( const void * key, const void * base, size_t * num, size_t * width, int (_USERENTRY * fcmp) (const void *, const void *) );
Copyright © 2009-2011 by Curt Hill
Commentary
• It uses void * to represent any pointer
• The key is a pointer to what is being searched for
• The base is an array, not necessarily of the same type as the key– It may contain the key and other stuff– The array has num entries and each
entry is width bytes longCopyright © 2009-2011 by Curt Hill
User Defined Function• fcmp is a user defined routine to
compare the key with a base item• It returns a negative if the first
parameter is less than second• It returns a zero if the first
parameter is equal to second• It returns a positive if the first
parameter is greater than second• If the item is found then it returns
the pointer to it and NULL otherwise
Copyright © 2009-2011 by Curt Hill
Example
Copyright © 2009-2011 by Curt Hill
int fcmp(const void * a, const void *b){ if(*(int *)a<*(int *)b)
return -1; // less if(*(int *)a==*(int *)b)
return 0; // equal return +1; // greater}...int key;int * table; ...bsearch(key, table, tablesize, 4, fcmp);
STL considerations• There is also a binary search in the
STL• The header is:• bool binary_search(first, last, const
T& value)• first and last are ForwardIterators in
the container• value is the item looked for• comp is comparison object to allow
you to specify the comparison• Of course, the container is ordered
Copyright © 2009-2011 by Curt Hill
Segmented Search
• Intermediate between binary and linear search– Easier to code than binary search– Faster than linear
• Requires sorted array• Depending on size of table may
come in two to four stages
Copyright © 2009-2011 by Curt Hill
Two Stage
• Divide the array into segments– Segment size is close to square root
of size
• First find the segment that contains desired item– Use a linear search but with segment
size increment
• Once segment is found find the desired item– Again with linear search
Copyright © 2009-2011 by Curt Hill
Example code• Assume that size of table is 64 and
it is sorted:int first = 0; last = 0;for(int i = 1;i<64;i+=8){ last = i; if(key > arr[i]) break; first = last; }for(int j = first;j<last;j++) if(key!=arr[j]) break;
Copyright © 2009-2011 by Curt Hill
Commentary
• The search should be O(2N½)• On the above table of 64 a linear
search that finds would take average 32 searches
• The segmented search will take no more than 16– Average is 8
• A binary search would average and 5.? and maximum 6
Copyright © 2009-2011 by Curt Hill
Other searches
• If the frequency of lookup is not uniform you may do some other things
• Storing the most commonly accessed items at the beginning of the list
• The most developed of which becomes a self organizing list
Copyright © 2009-2011 by Curt Hill
Hashing
• Often the best vector search technique
• Should be O(C) if done well• No restrictions on the key• The problems are well known and
discourage many from using
Copyright © 2009-2011 by Curt Hill
Problems with hashing• Insertions and deletions• Hash function does not generalize
well• No such thing as a general hash function• A good hash function is most often
constructed with knowledge of the data
• Performance degrades when full• Processing the data in a sorted
order requires an extra sort• Making the hash as robust as the
tree is quite difficult
Copyright © 2009-2011 by Curt Hill
Sermon
• Programmers usually avoid the hash because of these problems
• Very often this is the best of the search techniques
• The only question is: Is the work needed to make the hash the search technique of choice worth the work?– Depends on the application
Copyright © 2009-2011 by Curt Hill
Other containers
• Pointer based• Lists• Trees
Copyright © 2009-2011 by Curt Hill
Lists• Most of our techniques translate into lists
rather easily• Insertions and deletions are much easier• The lack of needing to know the size in
advance is also helpful– Dynamic arrays, including the STL vector,
are as convenient– There is a substantial run-time penalty
when the array has to be recopied to another larger array
• The main exception is the binary search– The binary search cannot be done since a
list is not a random access container– Most sorts do not work on a list either– Quick sort should work on a doubly linked
list Copyright © 2009-2011 by Curt Hill
Self organizing lists• Only list that is recommended for
searching– Only with very narrow criteria
• Types– Move to top
• Delete the item and push onto front
– Transpose• Remember the prior pointer and
exchange the two contents
– Sort by frequency is the hardest of the SOLs because you can move up a variable amount
Copyright © 2009-2011 by Curt Hill
Lists
• A self organizing list will provide good results only if:– Few items dominate the sought items– The list is relatively short
• Other than this lists are not a good search container unless– Search, insertion and deletion are
very infrequent– None are coded by the programmer
Copyright © 2009-2011 by Curt Hill
Trees
• Trees are inherently sorted– There is nothing like an unsorted list
• Flavors to consider– Unbalanced– Balanced– Optimal Search– Btree– Trie
Copyright © 2009-2011 by Curt Hill
Unbalanced tree• Normal searches perform slightly
worse than binary searches– Rarely balanced
• Advantage of log2N insertion time
• When the search failed, you are at the location that you want to insert at with no additional work
• The worst case tree deletion is better than the average table deletion and the average case is log2N
Copyright © 2009-2011 by Curt Hill
Balanced trees
• Search comparable to binary search• Insertions and deletions are
generally less painful• A rebalance can be quite extensive
and expensive– Generally a rebalance is less painful
than an insertion or deletion in a table because the sliding affects all the table to the end
– Recopying table is hidden cost
Copyright © 2009-2011 by Curt Hill
Tries• Most of the advantages of the tree but it
has two requirements to be useful:• Dense key• The key should have a small alphabet
and short length– This is not much of a consideration if the
key is truly dense
• A binary tree has a O(log2N) search time– While a trie has search time linear on the
length of the key rather than the number of entries
Copyright © 2009-2011 by Curt Hill
Optimal search trees• Somewhat similar to a list with
optimal static order but faster– Requires knowledge of the frequencies
• Like a binary search it generally cuts the items to be cut in half in each pass– The items are based on frequencies not
on keys
• Like a self organizing list it tends to find high frequency items quite quickly
Copyright © 2009-2011 by Curt Hill
Optimal search tree• A standard unbalanced tree may be
used– Need a prior program that orders the
keys based on frequency
• Generally not used if insertions and deletions are possible
• May be used for a set of keys that changes from day to day
• Keep the counts in every node and then write out tomorrows based on the frequencies
• Could be quite effective but complicated to implement
Copyright © 2009-2011 by Curt Hill
B-Trees
• Offer no advantages in memory– Searching the node offsets the
shallowness of the tree
• Preferred for disks• No DBMS should be without
Copyright © 2009-2011 by Curt Hill