Download ppt - Copyright © 2009-2011 by Curt Hill Searching and Sorting A Summary on Searching

Copyright © 2009-2011 by Curt Hill

Searching and Sorting

A Summary on Searching

The lesson from Neural Networks

• Neural networks are only used when there are no algorithms that always work

• We only use on hard problems• In NN there are never any absolute

answers• Instead each project is different and

we experiment with our options until we are happy

• We never are sure that this is the best answer we only hope that it is acceptable


Apply the lesson• This is the same problem with

constructing data structures for programs

• It is extremely rare for us to know in advance all the things that would make the decision easy:– The frequency or number of insertions,

deletions, lookups in an average run– The frequency distribution of the key– The density of the key– What will be the optimal container class– How the next revision will change all of

this


Therefore

• We make fuzzy choices based on incomplete information

• We then become good at spotting trends that favor one structure over another

• With that in mind let us come back and re-examine searching and sorting


Why consider both at once?

• Our containers will fall into one of three categories:– Unordered– Ordered by key– Ordered or partially ordered by something

other than key• In first case there is no notion of sorting• In rest there is

– In some cases we must sort before we get started searching

– In both cases an insert must do some type of partial sort to clean it up

– A delete may also affect the sorted order but is often easier to correct


Searching Arrays or Vectors

• Three areas to review here• Linear• Binary• Self-organizing lists


Linear searching• This is the best and worst• Advantages

– It is the easiest to code• Not much more than a for loop

– Does the best for small tables, typically less than 10

– Applicable to Lists as well as tables• Disadvantages

– Finding an item that is present in uniformly distributed array needs ½N probes

– Finding that an item is not present requires looking at each N items

– Clearly an O(N) algorithm which is the worst for a search


Why use?• When the advantages outweigh

the disadvantages• For small tables it is the preferred

choice• Often chosen early in a project

– If and when performance becomes a problem then upgrade the search based on what you now know about the project

– It may be that the vector is large but searched infrequently so it is not a problem


Sequential Search in C• There is a sequential search function

in stdlib.h• It will search an array of items using

a user defined comparison• The header is:void* lfind ( const void * key, const void * base, size_t * num, size_t * width, int (_USERENTRY * fcmp) (const void *, const void *));


Notes

• It uses void * to represent any pointer

• The key is a pointer to what is being searched for

• The base is an array, not necessarily of the same type as the key– It may contain the key and other stuff

• The array has num entries and each entry is width bytes long


The passed function

• fcmp is a user defined routine to compare the key with a base item

• Key is the first parameter• An array entry is the second• Returns zero for equal and

anything else for not equal• If the item is found then it returns

the pointer to it and NULL otherwise


Commentary

• Actually figuring out how to use this thing is probably harder than coding it from scratch

• However, it will generally use machine language statements

• Thus it should do better than any C style loop

• There is also in the C libraries:– A binary search we will see later– A quick sort routine


Example


int fcmpe(const void * a, const void *b){ if(*(int *)a == *(int *)b){ return 0; } return +1;}...size_t s = tablesize;int key;int * unsorted; // dynamic array...lfind(key, unsorted, &s, 4, fcmpe);

Commentary• The lfind is classic C• It is not a template function, but it

can be used much like a template function

• Must use:– void * pointers– Makes user specify the length– Requires a user-defined function for

comparison

• Then it will work on any array


STL considerations• The STL has a search which is

customarily interesting• It may search for an item or a

range of items– In any container

• The header looks like this:FI search(First1, Last1, First2, Last2)


STL Notes• The result and all parameters are

Forward iterators of the same container class type

• First1 through Last1 are n the container class to be searched

• First2 through Last2 may be in another container

• If First2=Last2 then just one item


STL Results

• If search finds it the result is the beginning of the sequence

• Otherwise it returns Last1• In order to use the stored types

must be suitable for the equality operator

• You may also provide your own predicate


Binary search

• The binary search requires a sorted table

• The sort order may be either ascending or descending– For this presentation assumed

ascending


Basic algorithm• Set low to 0, high to the last used• While low < high

– Set mid to be halfway between low and high

– Compare the mid item with key– If the mid item is equal you are done– If the mid item is less than the key

• Remove the lower half of the table• Set low to mid

– If the mid item is greater than the key• Remove the upper half of the table• Set high to mid


Commentary• The loop terminates when we find

item or the high and low bounds collapse

• We determine which after the loop• The advantages

– The search is O(log2N) because at each iteration we eliminate half of what is left

• The disadvantages– The loop is much more complicated

• Most people do not get it right the first time

– The array must be sorted before we get started


Sorting• Since sorting is either a O(N2) or

O(N log2N), this is a very serious ramification– You have to do quite a few searches

to pay for that sort

• If the table will allow insertions it complicates that as well– The search to find the item is log2N,

however the insertion may only be linear in an array, since we have to slide all the following items down one


C Function• There is a binary search function in

stdlib.h• It will search a sorted array of items

using a user defined comparisonvoid* bsearch ( const void * key, const void * base, size_t * num, size_t * width, int (_USERENTRY * fcmp) (const void *, const void *) );


Commentary

• It uses void * to represent any pointer

• The key is a pointer to what is being searched for

• The base is an array, not necessarily of the same type as the key– It may contain the key and other stuff– The array has num entries and each

entry is width bytes longCopyright © 2009-2011 by Curt Hill

User Defined Function• fcmp is a user defined routine to

compare the key with a base item• It returns a negative if the first

parameter is less than second• It returns a zero if the first

parameter is equal to second• It returns a positive if the first

parameter is greater than second• If the item is found then it returns

the pointer to it and NULL otherwise


Example


int fcmp(const void * a, const void *b){ if(*(int *)a<*(int *)b)

return -1; // less if(*(int *)a==*(int *)b)

return 0; // equal return +1; // greater}...int key;int * table; ...bsearch(key, table, tablesize, 4, fcmp);

STL considerations• There is also a binary search in the

STL• The header is:• bool binary_search(first, last, const

T& value)• first and last are ForwardIterators in

the container• value is the item looked for• comp is comparison object to allow

you to specify the comparison• Of course, the container is ordered


Segmented Search

• Intermediate between binary and linear search– Easier to code than binary search– Faster than linear

• Requires sorted array• Depending on size of table may

come in two to four stages


Two Stage

• Divide the array into segments– Segment size is close to square root

of size

• First find the segment that contains desired item– Use a linear search but with segment

size increment

• Once segment is found find the desired item– Again with linear search


Example code• Assume that size of table is 64 and

it is sorted:int first = 0; last = 0;for(int i = 1;i<64;i+=8){ last = i; if(key > arr[i]) break; first = last; }for(int j = first;j<last;j++) if(key!=arr[j]) break;


Commentary

• The search should be O(2N½)• On the above table of 64 a linear

search that finds would take average 32 searches

• The segmented search will take no more than 16– Average is 8

• A binary search would average and 5.? and maximum 6


Other searches

• If the frequency of lookup is not uniform you may do some other things

• Storing the most commonly accessed items at the beginning of the list

• The most developed of which becomes a self organizing list


Hashing

• Often the best vector search technique

• Should be O(C) if done well• No restrictions on the key• The problems are well known and

discourage many from using


Problems with hashing• Insertions and deletions• Hash function does not generalize

well• No such thing as a general hash function• A good hash function is most often

constructed with knowledge of the data

• Performance degrades when full• Processing the data in a sorted

order requires an extra sort• Making the hash as robust as the

tree is quite difficult


Sermon

• Programmers usually avoid the hash because of these problems

• Very often this is the best of the search techniques

• The only question is: Is the work needed to make the hash the search technique of choice worth the work?– Depends on the application


Other containers

• Pointer based• Lists• Trees


Lists• Most of our techniques translate into lists

rather easily• Insertions and deletions are much easier• The lack of needing to know the size in

advance is also helpful– Dynamic arrays, including the STL vector,

are as convenient– There is a substantial run-time penalty

when the array has to be recopied to another larger array

• The main exception is the binary search– The binary search cannot be done since a

list is not a random access container– Most sorts do not work on a list either– Quick sort should work on a doubly linked

list Copyright © 2009-2011 by Curt Hill

Self organizing lists• Only list that is recommended for

searching– Only with very narrow criteria

• Types– Move to top

• Delete the item and push onto front

– Transpose• Remember the prior pointer and

exchange the two contents

– Sort by frequency is the hardest of the SOLs because you can move up a variable amount


Lists

• A self organizing list will provide good results only if:– Few items dominate the sought items– The list is relatively short

• Other than this lists are not a good search container unless– Search, insertion and deletion are

very infrequent– None are coded by the programmer


Trees

• Trees are inherently sorted– There is nothing like an unsorted list

• Flavors to consider– Unbalanced– Balanced– Optimal Search– Btree– Trie


Unbalanced tree• Normal searches perform slightly

worse than binary searches– Rarely balanced

• Advantage of log2N insertion time

• When the search failed, you are at the location that you want to insert at with no additional work

• The worst case tree deletion is better than the average table deletion and the average case is log2N


Balanced trees

• Search comparable to binary search• Insertions and deletions are

generally less painful• A rebalance can be quite extensive

and expensive– Generally a rebalance is less painful

than an insertion or deletion in a table because the sliding affects all the table to the end

– Recopying table is hidden cost


Tries• Most of the advantages of the tree but it

has two requirements to be useful:• Dense key• The key should have a small alphabet

and short length– This is not much of a consideration if the

key is truly dense

• A binary tree has a O(log2N) search time– While a trie has search time linear on the

length of the key rather than the number of entries


Optimal search trees• Somewhat similar to a list with

optimal static order but faster– Requires knowledge of the frequencies

• Like a binary search it generally cuts the items to be cut in half in each pass– The items are based on frequencies not

on keys

• Like a self organizing list it tends to find high frequency items quite quickly


Optimal search tree• A standard unbalanced tree may be

used– Need a prior program that orders the

keys based on frequency

• Generally not used if insertions and deletions are possible

• May be used for a set of keys that changes from day to day

• Keep the counts in every node and then write out tomorrows based on the frequencies

• Could be quite effective but complicated to implement


B-Trees

• Offer no advantages in memory– Searching the node offsets the

shallowness of the tree

• Preferred for disks• No DBMS should be without