Multi Way Merge Sort

Embed Size (px)

Citation preview

Multi-way merge

The basic algorithm is 2-way merge - we use 2 output tapes. Assume that we have k tapes - then the number of passes will be reduced -

logk(N/M)At a given merge step we merge the first k runs, then the second k runs, etc. The task here: to find the smallest element out of k elements Solution: priority queues Idea: Take the smallest elements from the first k runs, store them into main memory in a heap tree. Then repeatedly output the smallest element from the heap. The smallest element is replaced with the next element from the run from which it came. When finished with the first set of runs, do the same with the next set of runs.

External Sorting: Example of multiway external sortingTa1:

17, 3, 29, 56, 24, 18, 4, 9, 10, 6, 45, 36, 11, 43

Assume that we have three tapes (k = 3) and the memory can hold three records. A. Main memory sort

The first three records are read into memory, sorted and written on Tb1, the second three records are read into memory, sorted and stored on Tb2, finally the third three records are read into memory, sorted and stored on Tb3. Now we have one run on each of the three tapes:

Tb1: 3, 17, 29 Tb2: 18, 24, 56 Tb3: 4, 9, 10The next portion of three records is sorted into main memory and stored as the second run on Tb1:

Tb1: 3, 17, 29, 6, 36, 45The next portion, which is also the last one, is sorted and stored onto Tb2:

Tb2: 18, 24, 56, 11, 43Nothing is stored on Tb3. Thus, after the main memory sort, our tapes look like this:

Tb1: 3, 17, 29, | 6, 36, 45, Tb2: 18, 24, 56, | 11, 43 Tb3: 4, 9, 10B. Merging

B.1. Merging runs of length M to obtain runs of length k*M In our example we merge runs of length 3 and the resulting runs would be of length 9. a. We build a heap tree in main memory out of the first records in each tape. These records are: 3, 18, and 4. b. We take the smallest of them - 3, using the deleteMin operation, and store it on tape Ta1. The record '3' belonged to Tb1, so we read the next record from Tb1 - 17, and insert it into the heap. Now the heap contains 18, 4, and 17.

c.

The next deleteMin operation will output 4, and it will be stored on Ta1. The record '4' comes from Tb3, so we read the next record '9' from Tb3 and insert it into the heap. Now the heap contains 18, 17 and 9.

d.

Proceeding in this way, the first three runs will be stored in sorted order on Ta1. Ta1:

3, 4, 9, 10, 17, 18, 24, 29, 56

Now it is time to build a heap of the second three runs. (In fact they are only two, and the run on Tb2 is not complete.) The resulting sorted run on Ta2 will be: Ta2:

6, 11, 36, 43, 45

This finishes the first pass. B.2. Building runs of length k*k*M We have now only two tapes: Ta1 and Ta2. o We build a heap of the first elements of the two tapes 3 and 6, and output the smallest element '3' to tape Tb1. o Then we read the next record from the tape where the record '3' belonged - Ta1, and insert it into the heap. o Now the heap contains 6 and 4, and using the deleteMin operation the smallest record - 4 is output to tape Tb1. Proceeding in this way, the entire file will be sorted on tape Tb1.

Tb1: 3, 4, 6, 9, 10, 11, 17, 18, 24, 29, 36, 43, 45, 56

logk(N/M). In the example this is [log3(14/3)] + 1 = 2.The number of passes for the multiway merging is

http://www.personal.kent.edu/~rmuhamma/Algorithms/algorithm.html

STRATEGY 4: DIVIDE AND CONQUER

There is a folk tale about a rich farmer who had seven sons. He was afraid that when he died, his land and his animals and all his possessions would be divided among his seven sons, and that they would quarrel with one another, and that their inheritance would be splintered and lost. So he gathered them together and showed them seven sticks that he had tied together and told them that any one who could break the bundle would inherit everything. They all tried, but no one could break the bundle. Then the old man untied the bundle and broke the sticks one by one. The brothers learned that they should stay together and work together and succeed together.

The moral for problem solvers is different. If you can't solve the problem, divide it into parts, and solve one part at a time. An excellent application of this strategy is the magic squares problem. It is well known. You have a square formed from three columns and three rows of smaller squares.

Divide-and-ConquerThis is a method of designing algorithms that (informally) proceeds as follows: Given an instance of the problem to be solved, split this into several, smaller, subinstances (of the same problem) independently solve each of the sub-instances and then combine the sub-instance solutions so as to yield a solution for the original instance. This description raises the question: By what methods are the sub-instances to be independently solved? The answer to this question is central to the concept of Divide-&-Conquer algorithm and is a key factor in gauging their efficiency. Consider the following: We have an algorithm, alpha say, which is known to solve all problem instances of size n in at most c n^2 steps (where c is some constant). We then discover an algorithm, beta say, which solves the same problem by:

Dividing an instance into 3 sub-instances of size n/2. Solves these 3 sub-instances. Combines the three sub-solutions taking d n steps to do this.

Suppose our original algorithm, alpha, is used to carry out the `solves these subinstances' step 2. Let T(alpha)( n ) = Running time of alpha T(beta)( n ) = Running time of beta

Two searching algorithm are Merge sort Quick sort

Dynamic ProgrammingThis paradigm is most often applied in the construction of algorithms to solve a certain class of

Optimisation ProblemThat is: problems which require the minimisation or maximisation of some measure. One disadvantage of using Divide-and-Conquer is that the process of recursively solving separate sub-instances can result in the same computations being performed repeatedly since identical sub-instances may arise. The idea behind dynamic programming is to avoid this pathology by obviating the requirement to calculate the same quantity twice. The method usually accomplishes this by maintaining a table of sub-instance results. Dynamic Programming is a

Bottom-Up Techniquein which the smallest sub-instances are explicitly solved first and the results of these used to construct solutions to progressively larger sub-instances. In contrast, Divide-and-Conquer is a

Top-Down Techniquewhich logically progresses from the initial instance down to the smallest sub-instances via intermediate sub-instances. We can illustrate these points by considering the problem of calculating the Binomial Coefficient, "n choose k", i.e.

There are three basic elements that characterize a dynamic programming algorithm: 1. Substructure Decompose the given problem into smaller (and hopefully simpler) subproblems. Express the solution of the original problem in terms of solutions for smaller problems. Note that unlike divide-and-conquer problems, it is not usually sufficient to consider one decomposition, but many different ones. 2. Table-Structure

After solving the subproblems, store the answers (results) to the subproblems in a table. This is done because (typically) subproblem solutions are reused many times, and we do not want to repeatedly solve the same problem over and over again. 3. Bottom-up Computation Using table (or something), combine solutions of smaller subproblems to solve larger subproblems, and eventually arrive at a solution to the complete problem. The idea of bottom-up computation is as follow: Bottom-up means i. ii. iii. Start with the smallest subproblems. Combining theirs solutions obtain the solutions to subproblems of increasing size. Until arrive at the solution of the original problem.

Once we decided that we are going to attack the given problem with dynamic programming technique, the most important step is the formulation of the problem. In other words, the most important question in designing a dynamic programming solution to a problem is how to set up the subproblem structure.

Greedy Algorithms"Greedy algorithms work in phases. In each phase, a decision is made that appears to be good, without regard for future consequences. Generally, this means that some local optimum is chosen. This 'take what you can get now' strategy is the source of the name for this class of algorithms. When the algorithm terminates, we hope that the local optimum is equal to the global optimum. If this is the case, then the algorithm is correct; otherwise, the algorithm has produced a suboptimal solution. If the best answer is not required, then simple greedy algorithms are sometimes used to generate approximate answers, rather than using the more complicated algorithms generally required to generate an exact answer."

Greedy AlgorithmsThis is another approach that is often used to design algorithms for solving

Optimisation ProblemsIn contrast to dynamic programming, however,

Greedy algorithms do not always yield a genuinely optimal solution. In such cases the greedy method is frequently the basis of a heuristic approach. Even for problems which can be solved exactly by a greedy algorithm, establishing the correctness of the method may be a non-trivial process.

In order to give a precise description of the greedy paradigm we must first consider a more detailed definition of the environment in which typical optimisation problems occur. Thus in an optimisation problem, one will have, in the context of greedy algorithms, the following:

A collection (set, list, etc) of candidates, e.g. nodes, edges in a graph, etc. A set of candidates which have already been `used'. A predicate (solution) to test whether a given set of candidates give a solution (not necessarily optimal). A predicate (feasible) to test if a set of candidates can be extended to a (not necessarily optimal) solution. A selection function (select) which chooses some candidate which h as not yet been used. An objective function which assigns a value to a solution.

In other words: An optimisation problem involves finding a subset, S, from a collection of candidates, C; the subset, S, must satisfy some specified criteria, i.e. be a solution and be such that the objective function is optimised by S. `Optimised' may mean

Knapsack Problem o O-I Knapsack o Fractional Knapsack Activity Selection Problem Huffman's Codes Minimum Spanning Tree Kruskal's Algorithm Prim's Algorithm Dijkstra's Algorithm

Kruskal's AlgorithmThis minimum spanning tree algorithm was first described by Kruskal in 1956 in the same paper where he rediscovered Jarnik's algorithm. This algorithm was also rediscovered in 1957 by Loberman and Weinberger, but somehow avoided being renamed after them. The basic idea of the Kruskal's algorithms is as follows: scan all edges in increasing weight order; if an edge is safe, keep it (i.e. add it to the set A).

Overall Strategy Kruskal's Algorithm, as described in CLRS, is directly based on the generic MST algorithm. It builds the MST in forest. Initially, each vertex is in its own tree in forest.

Then, algorithm consider each edge in turn, order by increasing weight. If an edge (u, v) connects two different trees, then (u, v) is added to the set of edges of the MST, and two trees connected by an edge (u, v) are merged into a single tree on the other hand, if an edge (u, v) connects two vertices in the same tree, then edge (u, v) is discarded. A little more formally, given a connected, undirected, weighted graph with a function w : E R.

Starts with each vertex being its own component. Repeatedly merges two components into one by choosing the light edge that connects them (i.e., the light edge crossing the cut between them). Scans the set of edges in monotonically increasing order by weight. Uses a disjoint-set data structure to determine whether an edge connects vertices in different components.

Data Structure Before formalizing the above idea, lets quickly review the disjoint-set data structure from Chapter 21.

Make_SET(v): Create a new set whose only member is pointed to by v. Note that for this operation v must already be in a set. FIND_SET(v): Returns a pointer to the set containing v. UNION(u, v): Unites the dynamic sets that contain u and v into a new set that is union of these two sets.

Algorithm Start with an empty set A, and select at every stage the shortest edge that has not been chosen or rejected, regardless of where this edge is situated in the graph. KRUSKAL(V, E, w) A{} Set A will ultimately contains the edges of the MST for each vertex v in V do MAKE-SET(v) sort E into nondecreasing order by weight w for each (u, v) taken from the sorted list do if FIND-SET(u) = FIND-SET(v) then A A {(u, v)} UNION(u, v) return A

Illustrative Examples Lets run through the following graph quickly to see how Kruskal's algorithm works on it:

We get the shaded edges shown in the above figure. Edge (c, f) : safe Edge (g, i) : safe Edge (e, f) : safe Edge (c, e) : reject Edge (d, h) : safe Edge (f, h) : safe Edge (e, d) : reject Edge (b, d) : safe Edge (d, g) : safe Edge (b, c) : reject Edge (g, h) : reject Edge (a, b) : safe At this point, we have only one component, so all other edges will be rejected. [We could add a test to the main loop of KRUSKAL to stop once |V| 1 edges have been added to A.] Note Carefully: Suppose we had examined (c, e) before (e, f ). Then would have found (c, e) safe and would have rejected (e, f ).

Example (CLRS)

Step-by-Step Operation of Kurskal's Algorithm.

Step 1. In the graph, the Edge(g, h) is shortest. Either vertex g or vertex h could be representative. Lets choose vertex g arbitrarily.

Step 2. The edge (c, i) creates the second tree. Choose vertex c as representative for second tree.

Step 3. Edge (g, g) is the next shortest edge. Add this edge and choose vertex g as representative.

Step 4. Edge (a, b) creates a third tree.

Step 5. Add edge (c, f) and merge two trees. Vertex c is chosen as the representative.

Step 6. Edge (g, i) is the next next cheapest, but if we add this edge a cycle would be created. Vertex c is the representative of both.

Step 7. Instead, add edge (c, d).

Step 8. If we add edge (h, i), edge(h, i) would make a cycle.

Step 9. Instead of adding edge (h, i) add edge (a, h).

Step 10. Again, if we add edge (b, c), it would create a cycle. Add edge (d, e) instead to complete the spanning tree. In this spanning tree all trees joined and vertex c is a sole representative.

Jarnik's (Prim's) AlgorithmThe oldest and simplest MST algorithm was discovered by Boruvka in 1926. The Boruvka's algorithm was rediscovered by Choquet in 1938; again by Florek, Lukaziewicz, Perkal, Stienhaus, and Zubrzycki in 1951; and again by Sollin in early 1960's. The next oldest MST algorithm was first described by the Polish mathematician Vojtech Jarnik in a 1929 letter to Boruvka. The algorithm was independently rediscovered by Kruskal in 1956, by Prim in 1957, by Loberman and Weinberger in 1957, and finally by Dijkstra in 1958. This algorithm is (inappropriately) called Prim's algorithm, or sometimes (even more inappropriately) called 'the Prim/Dijkstra algorithm'. The basic idea of the Jarnik's algorithm is very simple: find A's safe edge and keep it (i.e. add it to the set A).

Overall Strategy Like Kruskal's algorithm, Jarnik's algorithm, as described in CLRS, is based on a generic minimum spanning tree algorithm. The main idea of Jarnik's algorithm is similar to that of Dijkstra's algorithm for finding shortest path in a given graph. The Jarnik's algorithm has the property that the edges in the set A always form a single tree. We begin with some vertex v in a given graph G =(V, E), defining the initial set of vertices A. Then, in each iteration, we choose a minimum-weight edge (u, v), connecting a vertex v in the set A to the vertex u outside of set A. Then vertex u is brought in to A. This process is repeated until a spanning tree is formed. Like Kruskal's algorithm, here too, the important

fact about MSTs is that we always choose the smallest-weight edge joining a vertex inside set A to the one outside the set A. The implication of this fact is that it adds only edges that are safe for A; therefore when the Jarnik's algorithm terminates, the edges in set A form a minimum spanning tree, MST.

Details

The Jarnik's algorithm builds one tree, so A is always a tree. It starts from an arbitrary "root" r. At each step, find a light edge crossing cut (VA, V VA), where VA = vertices that A is incident on. Add this edge to A.

Note that the edges of A are shaded. Now the question is how to find the light edge quickly? Use a priority queue Q:

Each object is a vertex in V VA. Key of v is minimum weight of any edge (u, v), where u VA. Then the vertex returned by EXTRACT-MIN is v such that there exists u VA and (u, v) is light edge crossing (VA, V VA). Key of v is if v is not adjacent to any vertices in VA.

The edges of A will form a rooted tree with root r :

Root r is given as an input to the algorithm, but it can be any vertex. Each vertex knows its parent in the tree by the attribute [v] = parent of v. Initialize [v] = NIL if v = r or v has no parent.

As algorithm progresses, A = {(v, [v]) : v V {r} Q}. At termination, VA = V Q = , so MST is A = {(v, [v]) : v V {r}}.

Dijkstra's Algorithm Dijkstra's algorithm solves the single-source shortest-path problem when alledges have non-negative weights. It is a greedy algorithm and similar to Prim's algorithm. Algorithm starts at the source vertex, s, it grows a tree, T, that ultimately spans all vertices reachable from S. Vertices are added to T in order of distance i.e., first S, then the vertex closest to S, then the next closest, and so on. Following implementation assumes that graph G is represented by adjacency lists.

Example: Step by Step operation of Dijkstra algorithm.Step1. Given initial graph G=(V, E). All nodes nodes have infinite cost except the source node, s, which has 0 cost.

Step 2. First we choose the node, which is closest to the source node, s. We initialize d[s] to 0. Add it to S. Relax all nodes adjacent to source, s. Update predecessor (see red arrow in diagram below) for all nodes updated.

Step 3. Choose the closest node, x. Relax all nodes adjacent to node x. Update predecessors for nodes u, v and y (again notice red arrows in diagram below).

Step 4. Now, node y is the closest node, so add it to S. Relax node v and adjust its predecessor (red arrows remember!).

Step 5. Now we have node u that is closest. Choose this node and adjust its neighbor node v.

Step 6. Finally, add node v. The predecessor list now defines the shortest path from each node to the source node, s.

backtrackingSuppose a problem may be expressed in terms of detecting a particular class of subgraph in a graph. Then the backtracking approach to solving such a problem would be: Scan each node of the graph, following a specific order, until

A subgraph constituting a solution has been found. or It is discovered that the subgraph built so far cannot be extended to be a solution.

If (2) occurs then the search process is `backed-up' until a node is reached from which a solution might still be found.

BacktrackingCopyright 2002 by David Matuszek Backtracking is a form of recursion. The usual scenario is that you are faced with a number of options, and you must choose one of these. After you make your choice you will get a new set of options; just what set of options you get depends on what choice you made. This procedure is repeated over and over until you reach a final state. If you made a good sequence of choices, your final state is a goal state; if you didn't, it isn't. Conceptually, you start at the root of a tree; the tree probably has some good leaves and some bad leaves, though it may be that the leaves are all good or all bad. You want to get to a good leaf. At each node, beginning with the root, you choose one of its children to move to, and you keep this up until you get to a leaf. Suppose you get to a bad leaf. You can backtrack to continue the search for a good leaf by revoking your most recent choice, and trying out the next option in that set of options. If you run out of options, revoke the choice that got you here, and try another choice at that node. If you end up at the root with no options left, there are no good leaves to be found. This needs an example.

1. 2. 3. 4. 5. 6. 7. 8. 9.

Starting at Root, your options are A and B. You choose A. At A, your options are C and D. You choose C. C is bad. Go back to A. At A, you have already tried C, and it failed. Try D. D is bad. Go back to A. At A, you have no options left to try. Go back to Root. At Root, you have already tried A. Try B. At B, your options are E and F. Try E. E is good. Congratulations!

In this example we drew a picture of a tree. The tree is an abstract model of the possible sequences of choices we could make. There is also a data structure called a tree, but

usually we don't have a data structure to tell us what choices we have. (If we do have an actual tree data structure, backtracking on it is called depth-first tree searching.)

Directed graphFrom Wikipedia, the free encyclopedia

A directed graph. A directed graph or digraph is a pair G = (V,A) (sometimes G = (V,E)) of:[1]

a set V, whose elements are called vertices or nodes, a set A of ordered pairs of vertices, called arcs, directed edges, or arrows (and sometimes simply edges with the corresponding set named E instead

Directed GraphsThe only difference between a directed graph and an undirected one is how the edges are defined. In an undirected graph, an edge is simply defined by the two vertices it connects. But, in a directed graph, the direction of the edge matters. For example, lets say a graph has two vertices v and w. In a directed graph with these two vertices, we would be allowed to have more than one edge. We could have an edge from v TO w, and another one from w TO v. In essence, each edge not only connects a pair of vertices, but also has a direction associated with it.

TemplatesFunction templatesFunction templates are special functions that can operate with generic types. This allows us to create a function template whose functionality can be adapted to more than one type or class without repeating the entire code for each type. In C++ this can be achieved using template parameters. A template parameter is a special kind of parameter that can be used to pass a type as argument: just like regular function parameters can be used to pass values to a function, template parameters allow to pass also types to a function. These function templates can use these parameters as if they were any other regular type. The format for declaring function templates with type parameters is:template function_declaration; template function_declaration;

The only difference between both prototypes is the use of either the keyword class or the keyword typename. Its use is indistinct, since both expressions have exactly the same meaning and behave exactly the same way. For example, to create a template function that returns the greater one of two objects we could use:1 2 3 4 template myType GetMax (myType a, myType b) { return (a>b?a:b); }

Here we have created a template function with myType as its template parameter. This template parameter represents a type that has not yet been specified, but that can be used in the template function as if it were a regular type. As you can see, the function template GetMax returns the greater of two parameters of this still-undefined type. To use this function template we use the following format for the function call:function_name (parameters);

For example, to call GetMax to compare two integer values of type int we can write:1 int x,y; 2 GetMax (x,y);

When the compiler encounters this call to a template function, it uses the template to automatically generate a function replacing each appearance of myType by the type passed as the actual template parameter (int in this case) and then calls it. This process is automatically performed by the compiler and is invisible to the programmer. In this case, we have used T as the template parameter name instead of myType because it is shorter and in fact is a very common template parameter name. But you can use any identifier you like. In the example above we used the function template GetMax() twice. The first time with arguments of type int and the second one with arguments of type long. The compiler has instantiated and then called each time the appropriate version of the function. As you can see, the type T is used within the GetMax() template function even to declare new objects of that type:T result;

Therefore, result will be an object of the same type as the parameters a and b when the function template is instantiated with a specific type. In this specific case where the generic type T is used as a parameter for GetMax the compiler can find out automatically which data type has to instantiate without having to explicitly specify it within angle brackets (like we have done before specifying and ). So we could have written instead:1 int i,j; 2 GetMax (i,j);

Since both i and j are of type int, and the compiler can automatically find out that the template parameter can only be int. This implicit method produces exactly the same result: