Advanced Algorithms and Data Structureselomaa/teach/AADS-2015-1.pdf · • We might consider organizing a seminar with ... two doubly linked lists, and a priority queue ... at a time

8/27/2015

1

Advanced Algorithmsand Data Structures

Prof. Tapio [email protected]

Course Prerequisites

• A seven credit unit course• Replaced OHJ-2156 Analysis of Algorithms• We take things a bit further than OHJ-2156

• We will assume familiarity with– Necessary mathematics– Elementary data structures– Programming

27-Aug-15MAT-72006 AADS, Fall 2015 2

8/27/2015

2

Course Basics

• There will be 4 hours of lectures per week• Weekly exercises start in two weeks time

• We will not have a programming exercise thisyear (unless you demand to have one)

• We might consider organizing a seminar withvoluntary presentations (yielding extra points)at the end of the course

27-Aug-15MAT-72006 AADS, Fall 2015 3

Organization & Timetable

• Lectures: Prof. Tapio Elomaa– Tue & Thu 2–4 PM in TB223 (& TB219 in PII)– Aug. 25 – Dec. 3, 2015

• Period break Oct. 12–18, 2015

• Exercises: M.Sc. Juho Lauri & M.Sc. TeemuHeinimäki,

Mon 10–12 TC315, Start: Sept. 9• Exam: Fri Dec. 18, 2015

27-Aug-15MAT-72006 AADS, Fall 2015 4

8/27/2015

3

Guest Lecture

• Dr. Timo Aho from Solita• In October

– Big data– Definitions of data analytics– Methods– Case studies

27-Aug-15MAT-72006 AADS, Fall 2015 5

Course Grading

27-Aug-15MAT-72006 AADS, Fall 2015 6

• Exam: Maximum of 30 points• Weekly exercises yield extra points

• 40% of questions answered: 1 point• 80% answered: 6 points• In between: linear scale (so that

decimals are possible)• Final grading depends on what we

agree as course components

8/27/2015

4

Material

• The textbook of the course is– Cormen, Leiserson, Rivest, Stein: Introduction

to Algorithms, 3rd ed., MIT Press, 2009• There is no prepared material, the slides

appear in the web as the lectures proceed– http://www.cs.tut.fi/~elomaa/teach/72006.html

• The exam is based on the lectures (i.e., noton the slides only)

27-Aug-15MAT-72006 AADS, Fall 2015 7

Content (Plan)

I FoundationsII (Sorting and) Order StatisticsIII Data StructuresIV Advanced Design and Analysis

TechniquesV Advanced Data StructuresVI Graph AlgorithmsVII Selected Topics

27-Aug-15MAT-72006 AADS, Fall 2015 8

8/27/2015

5

I Foundations

The Role of Algorithms in ComputingGetting Started

Growth of FunctionsRecurrences

Probabilistic Analysis andRandomized Algorithms

II (Sorting and) OrderStatistics

HeapsortQuicksort

Sorting in Linear TimeMedians and Order Statistics

8/27/2015

6

III Data Structures

Elementary Data StructuresHash Tables

Binary Search TreesRed-Black Trees

IV Advanced Design andAnalysis Techniques

Dynamic ProgrammingGreedy AlgorithmsAmortized Analysis

8/27/2015

7

V Advanced DataStructures

B-TreesBinomial Heaps

Fibonacci Heaps

VI Graph Algorithms

MAT-62756 GRAPH THEORY

Elementary Graph AlgorithmsMinimum Spanning Trees

Single-Source Shortest PathsAll-Pairs Shortest Paths

Maximum Flow

8/27/2015

8

VII Selected Topics

Matrix OperationsLinear Programming

Number-Theoretic AlgorithmsApproximation Algorithms

Part of these could also bestudent presentation topics

Online ChiMerge

• In machine learning and data mining algorithms oneoften needs to discretize numerical attributes

• There are initial intervals that cannot be divided(examples that have the same value for the attributein question)

• CHIMERGE is intended as a preprocessing techniquefor learning algorithms

• It combines adjacent initial intervals bottom-up tocome up with the final discretization of the valuerange of the numerical attribute in question

27-Aug-15MAT-72006 AADS, Fall 2015 16

8/27/2015

9

27-Aug-15MAT-72006 AADS, Fall 2015 17

Combine? Combine? Combine?

Combine? Combine?

Combine?

Combine? Combine?

Combine?

• The worst-case time complexity of the batchalgorithm is lg )

• The main requirement for the online versionof CHIMERGE (OCM) is a processing time thatis, per each received training example,logarithmic in the number of intervals

• Thus, we altogether match the lg ) timerequirement of the batch version ofCHIMERGE

27-Aug-15MAT-72006 AADS, Fall 2015 18

8/27/2015

10

• If each new example drawn fromDATASOURCE would update the currentdiscretization immediately, this would yield aslowing down of OCM by a factor ascompared to the batch version

• Therefore, we update the active discretizationonly periodically, freezing it for theintermediate period

• Examples received during this time can behandled at a logarithmic time

• The price to be paid is maintaining someother data structures in addition to the BST

27-Aug-15MAT-72006 AADS, Fall 2015 19

• OCM uses a balanced binary search tree intowhich each example drawn from DATASOURCE is(eventually) submitted

• The algorithm continuously takes in newexamples from DATASOURCE, but instead of juststoring them to , it updates the required datastructures simultaneously

• We do not halt the ow of the data streambecause of extra processing of the datastructures, but amortize the required work tonormal example handling

• BST plus a queue, two doubly linked lists, anda priority queue (heap) are maintained

27-Aug-15MAT-72006 AADS, Fall 2015 20

8/27/2015

11

27-Aug-15MAT-72006 AADS, Fall 2015 21

• The discretization process works in three phases• In Phase 1 (of length time steps), the

examples from DATASOURCE are cached fortemporary storage to the queue

• Meanwhile the initial intervals are collected oneat a time from to a list that holds theintervals in their numerical order

• For each pair of two adjacent intervals, the -value is computed and inserted into the priorityqueue

• The queue is needed to freeze for theduration of the Phase 1

27-Aug-15MAT-72006 AADS, Fall 2015 22

8/27/2015

12

Procedure OCM )Input: Integer giving the number of exampleson which initial discretization is based on and real

which is the threshold for interval mergingData structures: A BST , a queue , two doublylinked lists and , and a priority queue1. Initialize , , , and as empty;2. Read training examples into tree ;3. phase 1; TR E E MIN IM U M );4. Make an empty priority queue of size 1;5. while DA T A SO U R C E () nil do

27-Aug-15MAT-72006 AADS, Fall 2015 23

6. if phase = 1 then EN Q U E U E ( , )7. else TR E E IN S E R T fi;8. if phase = 1 then - - - - - - - - - - - - - - - - - - - - - -9. Insert a copy of as the last item in ;10. if | > 1 then11. Compute the -value of the two last intervals in

;12. Using the obtained value as a key, add a

pointer to the two intervals in into ;13. fi14. TR E E SU C C E S S O R );15. if = nil then phase 2 fi

27-Aug-15MAT-72006 AADS, Fall 2015 24

8/27/2015

13

• After seeing examples, all the initial intervalsin have been inserted to and all the -values have been computed and added to

• At this point holds (unprocessed) examples• At the end of Phase 1 all required preparations

for interval merging have been done• Phase 2 implements the merging as in batch

CHIMERGE, but without stopping the processingof DATASOURCE

• As a result of each merging, the intervals inas well as the related -values in are updated

27-Aug-15MAT-72006 AADS, Fall 2015 25

• In Phase 2 the example received fromDATASOURCE is submitted directly to along withanother, previously received example that iscached in

• The lowest -value, , corresponding to someadjacent pair of intervals and , is drawn from

• If it exceeds the merging threshold, Phase 2is complete

• Otherwise, the intervals and are merged toobtain a new interval , and the -values onboth sides of are updated into

27-Aug-15MAT-72006 AADS, Fall 2015 26

8/27/2015

14

16. else if phase = 2 then - - - - - - - - - - - -- - - - - - -17. DE Q U E U E ); TR E E IN S E R T );18. , EX T R A C T MIN );19. if nil and then20. Remove from P the D-values of I and J;21. LIS T -DE L E T E ); LIS T DE L E T E );22. ME R G E );23. Compute the -values of with its neighbors in

;24. Update the -values into priority queue ;25. else26. phase 3; fi

27-Aug-15MAT-72006 AADS, Fall 2015 27

27. else - - - - - - - - - - - - - Phase = 3 - - - - - - - - - - - -28. DE Q U E U E ); TR E E -IN S E R T );29. LIS T -DE L E T E , HE A D );30. LIS T -IN S E R T );31. if EM P T Y ) then32. phase 1; TR E E -MIN IM U M );33. Make an empty priority queue, size 1;34. fi35. fi36. od

27-Aug-15MAT-72006 AADS, Fall 2015 28

8/27/2015

15

Time Complexity of OCM• The insertion of the received example to

clearly takes constant time, as well as theinsertion of an interval from to

• Also computing the -value can be considered aconstant-time operation

• Inserting the obtained value to requires time(lg )

• Finding a successor in also takes time (lg )• Thus, the total time consumption of one iteration

of Phase 1 is (lg )27-Aug-15MAT-72006 AADS, Fall 2015 29

• Insertion of the new example and one from the queueto the BST both take time (lg )

• Extracting the lowest -value from the priority queue isalso an lg time operation

• As contains pointers to the items holding intervalsand in , removing them can be implemented inconstant time

• Merging of and to obtain can be considered aconstant-time operation as well as computing the new -values for and its neighbors in

• Updating each of the new -values into takes lgtime

• Thus, the total time consumption of one iteration ofPhase 2 is also lg

27-Aug-15MAT-72006 AADS, Fall 2015 30

8/27/2015

16

The sorting problem

• Input: A sequence of numbers, … ,

• Output: A permutation (reordering), … ,

of the input sequence such that

• The numbers that we wish to sort are alsoknown as keys

27-Aug-15MAT-72006 AADS, Fall 2015 31

INSERTION-SORT )1. for 2 to2.3. // Insert ] into the sorted sequence [1. . – 1]4.5. while > 0 and ] >6. + 1] ]7. 18. [ + 1]

27-Aug-15MAT-72006 AADS, Fall 2015 32

8/27/2015

17

5 2 4 6 1 3

27-Aug-15MAT-72006 AADS, Fall 2015 33

2 5 4 6 1 3

2 4 5 6 1 3 2 4 5 6 1 3

1 2 4 5 6 3 1 2 3 4 5 6

Correctness of the Algorithm

• The following loop invariant helps usunderstand why the algorithm is correct:

At the start of each iteration of the forloop of lines 1–8, the subarray [1. . – 1]consists of the elements originally in

[1. . – 1] but in sorted order

27-Aug-15MAT-72006 AADS, Fall 2015 34

8/27/2015

18

Initialization

• The loop invariant holds before the first loopiteration, when = 2:– The subarray, therefore, consists of just the

single element [1]– It is in fact the original element in [1]– This subarray is trivially sorted– Therefore, the loop invariant holds prior to

the first iteration of the loop

27-Aug-15MAT-72006 AADS, Fall 2015 35

Maintenance• Each iteration maintains the loop invariant:

– The body of the for loop works by moving– 1], – 2], – 3], … by one position to

the right until the proper position for isfound (lines 4–7)

– At this point the value of ] is inserted (line8)

– The subarray [1. . ] then consists of theelements originally in [1. . ], but in sortedorder

27-Aug-15MAT-72006 AADS, Fall 2015 36

8/27/2015

19

Termination

• The condition causing the for loop toterminate is that

• Because each loop iteration increases by 1,we must have + 1 at that time

• Substituting + 1 for j in the wording of loopinvariant, we have that the subarray [1. . ]consists of the elements originally in [1. . ],but in sorted order

• [1. . ] is the entire array27-Aug-15MAT-72006 AADS, Fall 2015 37

Analysis of insertion sort

• The time taken by the INSERTION-SORTdepends on the input:– sorting a thousand numbers takes longer than

sorting three numbers• Moreover, the procedure can take different

amounts of time to sort two input sequencesof the same size– depending on how nearly sorted they already

are27-Aug-15MAT-72006 AADS, Fall 2015 38

8/27/2015

20

Input size

• The time taken by an algorithm grows withthe size of the input

• Traditional to describe the running time of aprogram as a function of the size of its input

• For many problems, such as sorting mostnatural measure for input size is the numberof items in the input—i.e., the array size

27-Aug-15MAT-72006 AADS, Fall 2015 39

• For, e.g., multiplying two integers, the bestmeasure is the total number of bits needed torepresent the input in binary notation

• Sometimes, more appropriate to describe thesize with two numbers rather than one

• E.g., if the input to an algorithm is a graph,the input size can be described by thenumbers of vertices and edges in it

27-Aug-15MAT-72006 AADS, Fall 2015 40

8/27/2015

21

Running time

• Running time of an algorithm on an input:– The number of primitive operations (“steps”)

executed• Step as machine-independent as possible• For the moment:

– Constant amount of time to execute each lineof pseudocode

– We assume that each execution of the th linetakes time , where is a constant

27-Aug-15MAT-72006 AADS, Fall 2015 41

27-Aug-15MAT-72006 AADS, Fall 2015 42

INSERTION-SORT ) cost times

1 for 2 to 1

2 ] 2 – 13 // Insert into the sorted sequence [1. . ] 0 – 1

4 – 4 – 1

5 while > 0 and ] > 5

6 + 1] ] 6 1)

7 – 7 1)

8 + 1] 8 – 1

8/27/2015

22

• denotes the number of times the while looptest in line 5 is executed for that value of

• When a for or while loop exits in the usualway, the test is executed one time more thanthe loop body

• Comments are not executable statements, sothey take no time

• Running time of the algorithm is the sum ofthose for each statement executed

27-Aug-15MAT-72006 AADS, Fall 2015 43

• To compute ), the running time ofINSERTION-SORT on an input of values,– we sum the products of the cost and times

columns, obtaining

1 + 1 +

1) 1 + 1)

27-Aug-15MAT-72006 AADS, Fall 2015 44

8/27/2015

23

Best case

• The best case occurs if the array is alreadysorted

• For each = 2, 3, … , , we then nd that] < in line 5 when has its initial value of1

• Thus = 1 for = 2, 3, … , , and the best-caserunning time is

1 + 1 + 1 + 1)

27-Aug-15MAT-72006 AADS, Fall 2015 45

Worst case

• We can express this as for constantsand that depend on the statement costs

• It is a linear function of• The worst case results when the array is in

reverse sorted order — in decreasing order• We must compare each element ] with each

element in the entire sorted subarray [1. . – 1],and so for = 2, 3, … ,

27-Aug-15MAT-72006 AADS, Fall 2015 46

8/27/2015

24

• Note that

=+ 1)

21

and

1 =1)

2by the summation of an arithmetic series

=+ 1)

227-Aug-15MAT-72006 AADS, Fall 2015 47

• The worst-case running time of INSERTION-SORT is

1 + 1+ 1)

2 1 +1)

21)

2 1)

= 2 + 2 + 2 +

( + + )27-Aug-15MAT-72006 AADS, Fall 2015 48

8/27/2015

25

• We can express this worst-case running timeas 2 for constants , , and thatdepend on the statement costs

• It is a quadratic function of• The rate of growth, or order of growth, of

the running time really interests us• We consider only the leading term of a

formula ( 2); the lower-order terms arerelatively insigni cant for large values of

27-Aug-15MAT-72006 AADS, Fall 2015 49

• We also ignore the leading term’s coef cient,constant factors are less signi cant than therate of growth in determining computationalef ciency for large inputs

• For insertion sort, we are left with the factorof 2 from the leading term

• We write that insertion sort has a worst-caserunning time of 2)(“theta of -squared”)

27-Aug-15MAT-72006 AADS, Fall 2015 50

8/27/2015

26

2.3 Designing algorithms

• Insertion sort is an incremental approach: havingsorted [1. . – 1], we insert ] into its properplace, yielding sorted subarray [1. . ]

• Let us examine an alternative design approach,known as “divide-and-conquer”

• We design a sorting algorithm whose worst-caserunning time is much lower

• The running times of divide-and-conqueralgorithms are often easily determined

27-Aug-15MAT-72006 AADS, Fall 2015 51

The divide-and-conquer approach

• Many useful algorithms are recursive:– to solve a problem, they call themselves to

deal with closely related subproblems• These algorithms typically follow a divide-

and-conquer approach:– Break the problem into subproblems that

resemble the original problem but are smaller,– Solve the subproblems recursively,– Combine these solutions to create a solution

to the original problem27-Aug-15MAT-72006 AADS, Fall 2015 52

8/27/2015

27

• The paradigm involves three steps at eachlevel of the recursion:1. Divide the problem into a number of

subproblems that are smaller instances ofthe same problem

2. Conquer the subproblems by solving themrecursively

• If the sizes are small enough, just solve thesubproblems in a straightforward manner

3. Combine the solutions to the subproblemsinto the solution for the original problem

27-Aug-15MAT-72006 AADS, Fall 2015 53

The merge sort algorithm

• Divide: Divide the -element sequence intotwo subsequences of /2 elements each

• Conquer: Sort the two subsequencesrecursively using merge sort

• Combine: Merge the two sortedsubsequences to produce the sorted answer– Recursion “bottoms out” when the sequence

to be sorted has length 1: a sequence oflength 1 is already in sorted order

27-Aug-15MAT-72006 AADS, Fall 2015 54

8/27/2015

28

• The key operation is the merging of two sortedsequences in the “combine” step

• We call auxiliary procedure MERGE ),where is an array and , , and are indicessuch that

• The procedure assumes that the subarrays. . and + 1. . ] are in sorted order and

• merges them to form a single sorted subarraythat replaces the current subarray . . ]

27-Aug-15MAT-72006 AADS, Fall 2015 55

MERGE )1. 1 – + 12. 2 –3. Let [1. . 1 + 1] and

[1. . 2 + 1] be newarrays

4. for 1 to 1

5. – 1]6. for 1 to 2

7.

8. 1 + 1]9. 2 + 1]10. 111. 112. for to13. if14. ]15. + 116. else ]17. + 1

27-Aug-15MAT-72006 AADS, Fall 2015 56

8/27/2015

29

• Line 1 computes the length 1 of the subarray. . ]; similarly for 2 and + 1. . ] on line 2

• Line 3 creates arrays (left) and (right), oflengths 1 + 1 and 2 + 1, respectively– the extra position will hold the sentinel

• The for loop of lines 4–5 copies . . into[1. . ];

• Lines 6–7 copy + 1. . into [1. . 2]• Lines 8–9 put the sentinels at the ends of and

27-Aug-15MAT-72006 AADS, Fall 2015 57

• Lines 10–17 perform the + 1 basic stepsby maintaining the following loop invariant:– At the start of each iteration of the for loop of

lines 12–17, . . – 1] contains the –smallest elements of [1. . 1 + 1] and

[1. . 2 + 1], in sorted order– Moreover, ] and ] are the smallest

elements of their arrays that have not beencopied back into

27-Aug-15MAT-72006 AADS, Fall 2015 58

8/27/2015

30

1 2 3 4 5

1 2 3 6

1 2 3 4 5

2 4 5 7

27-Aug-15MAT-72006 AADS, Fall 2015 59

8 9 10 11 12 13 14 15 16 17

… 2 4 5 7 1 2 3 6 …

MERGE( , 9, 12, 16)

1 2 3 4 5

2 4 5 7

1 2 3 4 5

1 2 3 6

8 9 10 11 12 13 14 15 16 17

… 1 4 5 7 1 2 3 6 …

1 2 3 4 5

1 2 3 6

1 2 3 4 5

2 4 5 7

27-Aug-15MAT-72006 AADS, Fall 2015 60

8 9 10 11 12 13 14 15 16 17

… 1 2 2 7 1 2 3 6 …

8 9 10 11 12 13 14 15 16 17

… 1 2 5 7 1 2 3 6 …

1 2 3 4 5

2 4 5 7

1 2 3 4 5

1 2 3 6

8/27/2015

31

1 2 3 4 5

1 2 3 6

27-Aug-15MAT-72006 AADS, Fall 2015 61

8 9 10 11 12 13 14 15 16 17

… 1 2 2 3 4 2 3 6 …

1 2 3 4 5

2 4 5 7

8 9 10 11 12 13 14 15 16 17

… 1 2 2 3 1 2 3 6 …

1 2 3 4 5

2 4 5 7

i

1 2 3 4 5

1 2 3 6

1 2 3 4 5

1 2 3 6

27-Aug-15MAT-72006 AADS, Fall 2015 62

8 9 10 11 12 13 14 15 16 17

… 1 2 2 3 4 5 6 6 …

1 2 3 4 5

2 4 5 7

8 9 10 11 12 13 14 15 16 17

… 1 2 2 3 4 5 3 6 …

1 2 3 4 5

2 4 5 7

1 2 3 4 5

1 2 3 6

8/27/2015

32

27-Aug-15MAT-72006 AADS, Fall 2015 63

8 9 10 11 12 13 14 15 16 17

… 1 2 2 3 4 5 6 7 …

1 2 3 4 5

2 4 5 7

1 2 3 4 5

1 2 3 6

• The needed – + 1 iterations of the lastfor loop have been executed:• [9. . 16] is sorted, and• the two sentinels in and are the only

two elements in these arrays that havenot been copied into

• MERGE procedure runs in ) time, where– + 1

– each of lines 1–3 and 8–11 takes constanttime

– the for loops of lines 4–7 take 1 2 =) time

– there are iterations of the for loop of lines12–17, each of which takes constant time

27-Aug-15MAT-72006 AADS, Fall 2015 64

8/27/2015

33

Merge sort

• The procedure MERGE-SORT sorts theelements in . .

• If , the subarray has at most oneelement and is therefore already sorted

• Otherwise, the divide step computes an indexthat partitions . . into two subarrays:

– . . ], containing /2 elements– + 1. . ], containing /2 elements

27-Aug-15MAT-72006 AADS, Fall 2015 65

MERGE-SORT )1. if2. )/23. MERGE-SORT )4. MERGE-SORT + 1, )5. MERGE )

27-Aug-15MAT-72006 AADS, Fall 2015 66

8/27/2015

34

Analysis of merge sort

• Our analysis assumes that the originalproblem size is a power of 2

• Each divide step then yields twosubsequences of size exactly /2

• We set up the recurrence for ), the worst-case running time of merge sort onnumbers

• Merge sort on just one element takesconstant time

27-Aug-15MAT-72006 AADS, Fall 2015 67

• When we have > 1 elements, we breakdown the running time as follows:– Divide: The step just computes the middle of

the subarray, which takes constant time:) = (1)

– Conquer: We recursively solve twosubproblems, each of size /2, whichcontributes /2) to the running time

– Combine: the MERGE procedure on an -element array takes time ), and so

) = )

27-Aug-15MAT-72006 AADS, Fall 2015 68

8/27/2015

35

• When we add the ) and ), we areadding functions that are ) and (1)

• This sum is a linear function of• Adding it to the /2) term from the

“conquer” step gives the recurrence for ):

=(1) if = 1

2) + ) if > 1

27-Aug-15MAT-72006 AADS, Fall 2015 69

• To intuitively see that the solution to therecurrence is ) = lg ), where lgstands for log2 , let us rewrite it as

= if = 12) + if > 1

where constant represents time required tosolve problems of size 1 and that per arrayelement of the divide and combine steps

27-Aug-15MAT-72006 AADS, Fall 2015 70

8/27/2015

36

27-Aug-15MAT-72006 AADS, Fall 2015 71

Documents

Advanced Algorithms and Data Structureselomaa/teach/AADS-2015-1.pdf · • We might consider organizing a seminar with ... two doubly linked lists, and a priority queue ... at a time