1/28 COP 3540 Data Structures with OOP Chapter 7 - Part 1 Advanced Sorting

1/28

COP 3540 Data Structures with OOP

Chapter 7 - Part 1Advanced Sorting

2/28

Advanced Sorting

Two sorts we will cover first. Shell Sort – an O(n(log2 n) 2) sort … in

general, and ‘can approach’ O(n) performance!

Partitioning, an O(nlog2n) sort.

Then, we’ll cover the QuickSort.

3/28

Recall how the Insertion Sort worked.

Took an element out of the ‘array’ and assumed all elements ‘to the left’ were sorted.

We marked this spot. And we extracted out that element. We then

compared the element extracted out with the elements ‘to the left’ of this element and

‘inserted’ this element into its proper place shifting all elements to the right as needed to

make room for this inserted element and fill the vacated spot.

4/28

Approach that helped us:

Constraints: Helped ourselves by:

• starting with a single element to the left – so knew ‘that’ element was sorted - certainly sorted unto itself.

Then we proceeded:• Slowly the elements to the ‘left’ of the

marked element grew in sorted number, as new numbers find their proper place in the subarray to the left - while the unsorted elements to the right diminish in number.

5/28

Potential Problems with the Insertion Sort

Now, what happens if the new number to be sorted is very small (or very large) and our sort is ‘ascending (or descending)?’

This may require a large number of ‘copies’ to the right to make room for this new element. Can require a number of ‘copies’ close to ‘n’ in fact. Average number of copies is clearly n/2. For n elements to be sorted and an average of n/2

copies per element, we have n*n/2 or n2/2 copies. That may result in a very inefficient sort. This is how the insertion sort is an O(n2) sort.

It is this number of copies (comparing and shifting) that decreases its performance.

6/28

Shell Sort Approach Want to reduce these numbers of large shifts Shell sort does this by sorting a very small subset

of numbers – like three or four: Where the numbers themselves might be large

distances apart (like in a large array) and it sorts them with respect to each other

By sorting a small number of numbers, very small (or very large) numbers can be put much more nearly ‘in place’ much more quickly than with other approaches.

How done?

7/28

Shell Sort uses the notion of a ‘computed Gap’ The Shell Sort uses a computed ‘gap’ between

numbers represented by an ‘h’ as the distance between numbers in each subset to be sorted.

1. Sorts all numbers (say in the array of numbers) with the same ‘h’ (gap)

• Like, numbers eight apart – or four apart…

• Sorts these numbers with respect to each other.

2. Then, after doing this, the algorithm reduces the gap (or distance) to a smaller number, like maybe 4 apart.

3. (Ultimately the gap has size = 1;) Then the algorithm ‘1-sorts’ the array using the insertion sort.

8/28

Example

Consider: sort three elements at a time with respect to each other, where the numbers are some ‘h’ distance apart

……………………………………………………. For array size n=10, and if gap size h = 4, we

have four sub-arrays: (We call this a 4-sort) Indices: (0,4,8), (1,5,9), (2,6) and (3,7).

These sets are sorted with respect to each other.

(Note: all ten are sorted!) Arrays are interleaved, but, again, sorted with

respect to each other. (Note: the integers are not yet in final spot.

9/28

Consider Improved Performance!

Recall again the Insertion Sort Recalling how the insertion sort works, very efficient for arrays nearly sorted (fewer swaps and movement,

and yet can be very inefficient (due to shifts and copies) if the data are very unsorted.

• Particularly true for very large / very small numbers.

Shell sort does ‘n-sorting’ Capitalizes on initial position of elements especially if they are far

from where they might ultimately end up. Brings numbers more quickly to final position…(or nearer)

Algorithm moves elements that may be very far apart much closer to their final position more quickly thus reducing copying and shifting and swapping!

Shell Sort can approach O(n) performance: much better than O(n2) !

10/28

What about Larger Arrays? Gap Size?

Using a carefully researched algorithm to compute optimum gap size,. Don Knuth developed a ‘recursive’ relationship:

h= 3*h+1 to start with, and then, subsequent gaps at (h-1)/3. (note the ‘recursion’ in the formula itself. Uses value of h to compute new value of h.

These h-values are referred to as interval sequence or gap sequence

and are recursively computed as functions of h.

In more detail:

11/28

Don Knuth’s algorithm will start with a 3-sort; that is, sort three numbers some distance apart.

By Don Knuth’s research reveals, as it turns out (algorithm is a few slides ahead), for an array of size > 364 and < 1093, 3-sort with a gap size of 364;

After that sort, use a gap size of 121; then gap size = 40; steadily decreasing…

Develop initial gap size recursively by computing h: (algorithm is three slides ahead)h 3*h+1 h is determined by computing the largest value

of h 1 4 computing h=h*3 +1 until h <= nElems/3 is false 4 13 13 40 So, computing h we see that h increases from 1 to 4 to

13 to 121 to 364 to …. 40 121 121 364 Once original gap is determined, sort continues and

algorithm steadily reduces gap h from 364 to 121 ..

364 1093 until h = 1 1093 3280 So for array size > 364 and < 1093, gap = 364, etc.

Gapsizes

12/28

Algorithm (covered in previous slide) Algorithm first uses a short loop to

generate the first (initial) value of h. Then, once we have an initial value of h:

additional values of h are recursively computed depending on the size of the array to be sorted.

Gap then starts with largest h-value.

For a 1000-element array, our initial gap size is 364.

After sorting, we would successively decrease the gap using the formula: h = (h-1)/3 as shown.

13/28

Note:

1. As it turns out, the algorithm actually sorts the first two elements of each group for a given gap first; then it goes back and sorts all three-element groups. This results in better performance time.

You will see this if you look carefully at the algorithm.

14/28

public void shellSort(){ int inner, outer; long temp; int h = 1; // find initial value of h while (h <= nElems/3) // COMPUTE GAP SIZE h = h*3 + 1; // (1, 4, 13, 40, 121, 364,...) // Compute initial value of h // Value of h depends on original size of array, nElems.

// start with largest gap (h-value) such that h < nElem/3 while (h > 0) // for 1000 element array, h = 364

{ for (outer=h; outer<nElems; outer++) // h – sort the structure… { // for 1000 elements, h = 364; outer < nElems (1000); increment by one. temp = theArray[outer]; inner = outer; while (inner > h-1 && theArray[inner-h] >= temp) { theArray[inner] = theArray[inner-h]; inner -= h; } // end while

theArray[inner] = temp;

} // end for

h = (h-1) / 3; // computes new gap: decreases h } // end while (h>0)} // end shellSort()

15/28

Google: Shell Sort Applet

Google: applet Lafore You will get a number of applet choices. Select and enjoy

16/28

Demo of Shell Sort Do n=12 and notice how the gap varies across the

bars. You can see when h goes from 4 to 1. Can see when it compares two in the interval …

then three; then 1-sorts.

Do 100 sort. It starts with h = 40. See it compares two of the

three in the interval until there are only intervals of two left.

There is a larger number of intervals when it goes to h= 13.

Go to h=4 and see more intervals yet. Finally, h=1.

Do this.

17/28

Shell Sort - Evaluation

Good for medium-sized array up to a few thousand items.

Shell Sort - O(n(log2n)2 ) is not as fast as the Quick Sort O(nlog2n) (coming soon)

Not so good for large files, but Easy to implement Requires very little extra space.

All sorts have a ‘worst case’ performance. For Shell Sorts, the

Worse case is not much worse than average performance, so this is good!

(Worse case is very different than average case in a Quick Sort).

18/28

Final Remarks on Shell Sort

Other sequences are available. Many alternatives available. Can experiment… Ultimately, need to end up with a 1 Forces last pass to be an insertion sort.

Guideline: Gaps should be relatively prime. Note Shell Sort’s numbers presented are not all prime (4,

40…). • This led to some earlier inefficiencies.

Experiments on Shell Sort yield performance mostly between O(n3/2) to O(n7/6)) or from almost O(n2) down to almost O(n)!

Quite a difference and the difference is realized as n increases, which makes sense.

19/28

Partitioning

20/28

Partitioning

Partitioning is key to QuickSort thinking.

Partitioning divides data into two groups dependent upon the value of a key. E.g. Divide students into two groups: < 3.0 gpa; > 3.0

• (Incidentally, why is a gpa of 3.0 important??)

We select a Pivot Value: value used to separate data items into two groups: end up with Data < pivot value and Data > pivot value.

21/28

Pivot Values

Note: pivot point can be any key value. Need not be a midpoint or value ‘half-way.’

Would be nice if pivot were half-way point, but we have no way of knowing…

Later we will see how the choice of the pivot impacts performance!

Pivot value used to separate array into left side and right side.

Ideally, we’d ‘like’ the sub-arrays to be roughly the same size, and we will work toward that reality.

22/28

Run Partition Algorithm to build Sub-Arrays

Once pivot value selected, we run the partition algorithm

Once run, data on the left side of the pivot ‘belongs’ to the left side of

the array (whatever number of elements may be on the left) and,

Data on the right side (>=) than the pivot value belong to the right side, however many elements are on the right side.

Note: Once partitioning is run, data is NOT sorted, But, the items are a lot ‘closer’ to their final position… And array is partitioned based on the pivot value.

23/28

The Partitioning Algorithm

Pick a pivot value… (more later)

Start with index at the left side of one partition. Let’s call it left scan.

Move toward the right. Compare element to pivot value. If an element is less than the pivot value, leave it alone. Move to the right.

Advance to the right until element is >= pivot value and then Stop.

Starting with index at right most index on the right side Let’s call it a right scan.

Move toward the left. Compare element to the pivot value If an element is >= pivot value, leave it alone; Move to the left.

Advance to the left until element is < pivot value and then Stop.

Swap the two values.

Iterate (back on the left; then right) until left and right scan are looking at the same entry.

….

24/28

Let’s look at the applet

25/28

Partition.html

Google: applet Lafore

Run with n=12 with various orderings…

Run with n=40. Notice the partition first and the final ordering…

Note: in running the partitioning algorithm the data are not totally sorted – but they are a good bit closer.

26/28

Partitioning and the Pivot Value Note partitioning is not stable. As elements on one side are moved to the other

side of the pivot value, they are NOT necessarily in the same relative positions in this ‘new’ partition!

In fact, they tend to be in reverse order.

Further, the number of elements on each side need not be the same either – depends on the pivot value.

Very likely, there is NOT the same number of elements on each side of the pivot.

27/28

One (of several) Problems with Partitioning

1. What if a poor pivot value were chosen such that all elements to the left were < pivot value?

Algorithm index keeps advancing. End up with array index out of bounds exception.

Ditto the other way. See code below.while (leftPtr < right && theArray[++leftPtr] < pivot)

; // nop

Clearly – as any program that is to be robust, there must be checks on the pivot value.

28/28

Efficiency of the Partition Algorithm is pretty efficient too Runs in O(n) time.

Pointers move from opposite ends moving and swapping at a constant rate.

If n were 2n, the algorithm would take roughly twice as long.

Thus the algorithm operates in O(n) time – means time is proportional to the number of items being sorted.

29/28

Efficiency of the Partitioning Algorithm

Non random data yields terrible results! If data is inversely ordered, then every pair will be

swapped, so n/2 swaps! Very inefficient! Multiply this by n elements and we have a n2 /2. Poor!

Random data: yields fewer than n/2 swaps. Some will already be in the right place. On average for random data, about half of maximum no.

of swaps will take place.

Regardless of random / non-random, both situations result in an efficiency proportional to n.

Documents

1/28 COP 3540 Data Structures with OOP Chapter 7 - Part 1 Advanced Sorting