dsrajnor.files.wordpress.com · Web viewHigh Performance Computing. Class : BE Computer Date:01/10/2019 Time: 09:30 to 12:00 Marks : 70 Duration : 2.5 Hr. Q 1(a) Explain SIMD, MIMD

SNJB's Late Sau. Kantabai Bhavarlaji Jain College of EngineeringDepartment of Computer Engineering

Academic Year : 2019-20(Semester 1)

Mock End Sem Exam Paper SolutionHigh Performance Computing

Class : BE Computer Date:01/10/2019 Time: 09:30 to 12:00 Marks : 70 Duration : 2.5 Hr.

Q 1(a) Explain SIMD, MIMD and SIMT architecture.

• Parallelism can be expressed at various levels of granularity - from instruction level to

processes.

• Between these extremes exist a range of models, along with corresponding architectural

support.

• Processing units in parallel computers either operate under the centralized control of a

single control unit or work independently.

• If there is a single control unit that dispatches the same instruction to various processors

(that work on different data), the model is referred to as single instruction stream,

multiple data stream (SIMD).

• If each processor has its own control control unit, each processor can execute different

instructions on different data items. This model is called multiple instruction stream,

multiple data stream (MIMD).

•

• Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and

MasPar MP-1 belonged to this class of machines.

• Variants of this concept have found use in co-processing units such as the MMX units in

Intel processors and DSP chips such as the Sharc.

• SIMD relies on the regular structure of computations (such as those in image processing).

• It is often necessary to selectively turn off operations on certain data items. For this

reason, most SIMD programming paradigms allow for an ``activity mask'', which

determines if a processor should participate in a computation or not.

Q 1(b) Explain basic working principal of VLIW processor

• The hardware cost and complexity of the superscalar scheduler is a major consideration

in processor design.

• To address this issues, VLIW processors rely on compile time analysis to identify and

bundle together instructions that can be executed concurrently.

• These instructions are packed and dispatched together, and thus the name very long

instruction word.

• This concept was used with some commercial success in the Multiflow Trace machine

(circa 1984).

• Variants of this concept are employed in the Intel IA64 processors.

• Issue hardware is simpler.

• Compiler has a bigger context from which to select co-scheduled instructions.

• Compilers, however, do not have runtime information such as cache misses. Scheduling

is, therefore, inherently conservative.

• Branch and memory prediction is more difficult.

• VLIW performance is highly dependent on the compiler. A number of techniques such as

loop unrolling, speculative execution, branch prediction are critical.

• Typical VLIW processors are limited to 4-way to 8-way parallelism.

Q 2(a) Explain any three data decomposition techniques with example

So how does one decompose a task into various subtasks? While there is no single recipe that

works for all problems, we present a set of commonly used techniques that apply to broad classes

of problems. These include:

• Recursive decomposition

• Data decomposition

• Exploratory decomposition

• Speculative decomposition

Recursive decomposition

• Generally suited to problems that are solved using the divide-and-conquer strategy.

• A given problem is first decomposed into a set of sub-problems.

• These sub-problems are recursively decomposed further until a desired granularity is

reached.

Data decomposition

• Identify the data on which computations are performed.

• Partition this data across various tasks.

• This partitioning induces a decomposition of the problem.

• Data can be partitioned in various ways - this critically impacts performance of a parallel

algorithm.

• Often, each element of the output can be computed independently of others (but

simply as a function of the input).

• A partition of the output across tasks decomposes the problem naturally.

Exploratory Decomposition

• In many cases, the decomposition of the problem goes hand-in-hand with its execution.

• These problems typically involve the exploration (search) of a state space of solutions.

• Problems in this class include a variety of discrete optimization problems (0/1 integer

programming, QAP, etc.), theorem proving, game playing, etc.

Q 2(b) Give the Characteristics of tasks

Once a problem has been decomposed into independent tasks, the characteristics of these tasks

critically impact choice and performance of parallel algorithms. Relevant task characteristics

include:

• Task generation.

• Task sizes.

• Size of data associated with tasks.

• Static task generation: Concurrent tasks can be identified a-priori. Typical matrix

operations, graph algorithms, image processing applications, and other regularly

structured problems fall in this class. These can typically be decomposed using data or

recursive decomposition techniques.

• Dynamic task generation: Tasks are generated as we perform computation. A classic

example of this is in game playing - each 15 puzzle board is generated from the previous

one. These applications are typically decomposed using exploratory or speculative

decompositions.

Q 3(a) Explain Broadcast and Reduction example for multiplying matrix with a vector

Parallel algorithms often require a single process to send identical data to all other

processes or to a subset of them. This operation is known as one-to-all broadcast. Initially,

only the source process has the data of size m that needs to be broadcast. At the termination

of the procedure, there are p copies of the initial data - one belonging to each process. The

dual of one-to-all broadcast is all-to-one reduction. In an all-to-one reduction operation,

each of the p participating processes starts with a buffer M containing m words. The data

from all processes are combined through an associative operator and accumulated at a

single destination process into one buffer of size m. Reduction can be used to find the sum,

product, maximum, or minimum of sets of numbers - the i th word of the accumulated M is

the sum, product, maximum, or minimum of the i th words of each of the original buffers.

Fig shows one-to-all broadcast and all-to-one reduction among p processes.

Q 3(b) Explain the Concept of Scatter and Gather

Let’s assume you have some P number of parallel processes doing some work. Let’s give them

numbers from 1 to P.

Now, suppose one of them has a data array, A, of P elements and wants to send other processes

the corresponding I the element.

Then scatter is the pattern to use. If you want the inverse, that’s to collect an element from each

processes in order then gather is the operation.

I am going to copy two images from this great description at MPI Scatter, Gather, and All gather

http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

Q 4(a) Explain sources of overhead in parallel programs

Using twice as many hardware resources, one can reasonably expect a program to run twice as

fast. However, in typical parallel programs, this is rarely the case, due to a variety of overheads

associated with parallelism. An accurate quantification of these overheads is critical to the

understanding of parallel program performance.

A typical execution profile of a parallel program is illustrated in Figure. In addition to

performing essential computation (i.e., computation that would be performed by the serial

program for solving the same problem instance), a parallel program may also spend time in inter

process communication, idling, and excess computation (computation not performed by the

serial formulation).

Inter process Interaction Any nontrivial parallel system requires its processing elements to

interact and communicate data (e.g., intermediate results). The time spent communicating data

between processing elements is usually the most significant source of parallel processing

overhead.

http://parallelcomp.uw.hu/ch05lev1sec1.html#ch05fig01

Idling Processing elements in a parallel system may become idle due to many reasons such as

load imbalance, synchronization, and presence of serial components in a program. In many

parallel applications (for example, when task generation is dynamic), it is impossible (or at least

difficult) to predict the size of the subtasks assigned to various processing elements. Hence, the

problem cannot be subdivided statically among the processing elements while maintaining

uniform workload. If different processing elements have different workloads, some processing

elements may be idle during part of the time that others are working on the problem. In some

parallel programs, processing elements must synchronize at certain points during parallel

program execution. If all processing elements are not ready for synchronization at the same time,

then the ones that are ready sooner will be idle until all the rest are ready. Parts of an algorithm

may be un parallelizable, allowing only a single processing element to work on it. While one

processing element works on the serial part, all the other processing elements must wait.

Excess Computation The fastest known sequential algorithm for a problem may be difficult or

impossible to parallelize, forcing us to use a parallel algorithm based on a poorer but easily

parallelizable (that is, one with a higher degree of concurrency) sequential algorithm. The

difference in computation performed by the parallel program and the best serial program is the

excess computation overhead incurred by the parallel program.

Q 4(b) Explain the effect of Granularity on Performance

an instance of an algorithm that is not cost-optimal. The algorithm discussed in this example uses

as many processing elements as the number of inputs, which is excessive in terms of the number

of processing elements. In practice, we assign larger pieces of input data to processing elements.

This corresponds to increasing the granularity of computation on the processing elements. Using

fewer than the maximum possible number of processing elements to execute a parallel algorithm

is called scaling down a parallel system in terms of the number of processing elements. A naive

way to scale down a parallel system is to design a parallel algorithm for one input element per

processing element, and then use fewer processing elements to simulate a large number of

processing elements. If there are n inputs and only p processing elements (p < n), we can use the

parallel algorithm designed for n processing elements by assuming n virtual processing elements

and having each of the p physical processing elements simulate n/p virtual processing elements.

As the number of processing elements decreases by a factor of n/p, the computation at each

processing element increases by a factor of n/p because each processing element now performs

the work of n/p processing elements. If virtual processing elements are mapped appropriately

onto physical processing elements, the overall communication time does not grow by more than

a factor of n/p. The total parallel runtime increases, at most, by a factor of n/p, and the processor-

time product does not increase. Therefore, if a parallel system with n processing elements is cost-

optimal, using p processing elements (where p < n)to simulate n processing elements preserves

cost-optimality.

A drawback of this naive method of increasing computational granularity is that if a parallel

system is not cost-optimal to begin with, it may still not be cost-optimal after the granularity of

computation increases. This is illustrated by the following example for the problem of adding n

numbers.

Q 5(a) Explain in details scalability of parallel system

Very often, programs are designed and tested for smaller problems on fewer processing

elements. However, the real problems these programs are intended to solve are much larger, and

the machines contain larger number of processing elements. Whereas code development is

simplified by using scaled-down versions of the machine and the problem, their performance and

correctness (of programs) is much more difficult to establish based on scaled-down systems. In

this section, we will investigate techniques for evaluating the scalability of parallel programs

using analytical tools.

Q 5 (b) what are the Performance metrics for parallel system?

Performance Metrics for Parallel Systems

It is important to study the performance of parallel programs with a view to determining the best

algorithm, evaluating hardware platforms, and examining the benefits from parallelism. A

number of metrics have been used based on the desired outcome of performance analysis.

Execution Time

The serial runtime of a program is the time elapsed between the beginning and the end of its

execution on a sequential computer. The parallel runtime is the time that elapses from the

moment a parallel computation starts to the moment the last processing element finishes

execution. We denote the serial runtime by TS and the parallel runtime by TP.

Total Parallel Overhead

The overheads incurred by a parallel program are encapsulated into a single expression referred

to as the overhead function. We define overhead function or total overhead of a parallel system

as the total time collectively spent by all the processing elements over and above that required by

the fastest known sequential algorithm for solving the same problem on a single processing

element. We denote the overhead function of a parallel system by the symbol To.

The total time spent in solving a problem summed over all processing elements is pTP . TS units

of this time are spent performing useful work, and the remainder is overhead. Therefore, the

overhead function (To) is given by

Q 6 (a) Explain parallel Depth- First search in details

We start our discussion of parallel depth-first search by focusing on simple backtracking. Parallel

formulations of depth-first branch-and-bound and IDA* are similar to those discussed in this

section and are addressed.

The critical issue in parallel depth-first search algorithms is the distribution of the search space

among the processors. Consider the tree shown in figNote that the left subtree (rooted at node A)

can be searched in parallel with the right subtree (rooted at node B). By statically assigning a

node in the tree to a processor, it is possible to expand the whole subtree rooted at that node

without communicating with another processor. Thus, it seems that such a static allocation yields

a good parallel search algorithm.

Fig: The unstructured nature of tree search and the imbalance resulting from static partitioning.

In dynamic load balancing, when a processor runs out of work, it gets more work from another

processor that has work. Consider the two-processor partitioning of the tree in fig Assume that

nodes A and B are assigned to the two processors as we just described. In this case when the

processor searching the subtree rooted at node A runs out of work, it requests work from the

other processor. Although the dynamic distribution of work results in communication overhead

for work requests and work transfers, it reduces load imbalance among processors. This section

explores several schemes for dynamically balancing the load between processors.

Q 6(b) what are the Issues in sorting on parallel Computers

Issues in Sorting on Parallel Computers

Parallelizing a sequential sorting algorithm involves distributing the elements to be sorted onto

the available processes. This process raises a number of issues that we must address in order to

make the presentation of parallel sorting algorithms clearer.

Where the Input and Output Sequences are Stored

In sequential sorting algorithms, the input and the sorted sequences are stored in the process's

memory. However, in parallel sorting there are two places where these sequences can reside.

They may be stored on only one of the processes, or they may be distributed among the

processes. The latter approach is particularly useful if sorting is an intermediate step in another

algorithm. In this chapter, we assume that the input and sorted sequences are distributed among

the processes.

Consider the precise distribution of the sorted output sequence among the processes. A general

method of distribution is to enumerate the processes and use this enumeration to specify a global

ordering for the sorted sequence. In other words, the sequence will be sorted with respect to this

process enumeration. For instance, if Pi comes before Pj in the enumeration, all the elements

stored in Pi will be smaller than those stored in Pj . We can enumerate the processes in many

ways. For certain parallel algorithms and interconnection networks, some enumerations lead to

more efficient parallel formulations than others.

How Comparisons are Performed

A sequential sorting algorithm can easily perform a compare-exchange on two elements because

they are stored locally in the process's memory. In parallel sorting algorithms, this step is not so

easy. If the elements reside on the same process, the comparison can be done easily. But if the

elements reside on different processes, the situation becomes more complicated.

One Element Per Process

Consider the case in which each process holds only one element of the sequence to be sorted. At

some point in the execution of the algorithm, a pair of processes (Pi, Pj) may need to compare

their elements, ai and aj. After the comparison, Pi will hold the smaller and Pj the larger of {ai, aj}.

We can perform comparison by having both processes send their elements to each other. Each

process compares the received element with its own and retains the appropriate element. In our

example, Pi will keep the smaller and Pj will keep the larger of {ai, aj}. As in the sequential case,

we refer to this operation as compare-exchange. As fig illustrates, each compare-exchange

operation requires one comparison step and one communication step.

Q 7(a) Explain bubble sort and its variants

Since serial algorithms with Q(n log n) time complexity exist, we should be able to use Q(n)

processes to sort n elements in time Q(log n). As we will see, this is difficult to achieve. We can,

however, easily parallelize many sequential sorting algorithms that have Q(n2) complexity. The

algorithms we present are based on bubble sort.

The sequential bubble sort algorithm compares and exchanges adjacent elements in the sequence

to be sorted. Given a sequence <a1, a2, ..., an >, the algorithm first performs n - 1 compare-

exchange operations in the following order: (a1, a2), (a2, a3), ..., (an-1, an). This step moves the

largest element to the end of the sequence. The last element in the transformed sequence is then

ignored, and the sequence of compare-exchanges is applied to the resulting sequence . The

sequence is sorted after n - 1 iterations. We can improve the performance of bubble sort by

terminating when no exchanges take place during an iteration. The bubble sort algorithm is

shown in algo.

1. procedure BUBBLE_SORT(n)

2. begin

3. for i := n - 1 downto 1 do

4. for j := 1 to i do

5. compare-exchange(aj, aj + 1);

6. end BUBBLE_SORT

Q 7(b) Explain Cuda Architecture

Technology trends and advances in video games and graphics techniques have led to a need for

extremely powerful dedicated computational hardware to perform the necessary calculations.

Graphics hardware companies such as AMD/ATI and NVIDIA have developed graphics

processors capable of massively parallel processing, with large throughput and memory

bandwidth typically necessary for displaying high resolution graphics. However, these hardware

devices have the potential to be re-purposed and used for other non-graphics-related work.

NVIDIA provides a programming interface known as CUDA (Compute Unified Device

Architecture) which allows direct programming of the NVIDIA hardware. Using NVIDIA

devices to execute massively parallel algorithms will yield a many times speedup over sequential

implementations on conventional CPUs.

Q 8(a) Explain Parallel Programming in CUDA-C

The CUDA platform is designed to work with programming languages such as C, C++, and

Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU

resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills

in graphics programming.[3] Also, CUDA supports programming frameworks such as OpenACC

and OpenCL.[2] When it was first introduced by Nvidia, the name CUDA was an acronym for

Compute Unified Device Architecture but Nvidia subsequently dropped the common use of the

acronym.

https://en.wikipedia.org/wiki/CUDA#cite_note-CUDA_intro_-_TomsHardware-2

https://en.wikipedia.org/wiki/OpenCL

https://en.wikipedia.org/wiki/OpenACC

https://en.wikipedia.org/wiki/CUDA#cite_note-3

https://en.wikipedia.org/wiki/OpenGL

https://en.wikipedia.org/wiki/Direct3D

https://en.wikipedia.org/wiki/Fortran

https://en.wikipedia.org/wiki/C%2B%2B

https://en.wikipedia.org/wiki/C_(programming_language)

Example of CUDA processing flow

1. Copy data from main memory to GPU memory

2. CPU initiates the GPU compute kernel

3. GPU's CUDA cores execute the kernel in parallel

4. Copy the resulting data from GPU memory to main memory

Q 8(b) Explain Application of CUDA?

The main strong point of CUDA is highly parallel number crunching. Fortunately, this is a very

common type of problem encountered in many high performance computing problems. Here is a

list of some example applications which have been created using CUDA to achieve maximum

performance that is simply not possible on a CPU alone.

Fast Video Transcoding

Transcoding is a very common, and highly complex procedure which easily involves trillions of

parallel computations, many of which are floating point operations. Applications such as

Badaboom have been created which harness the raw computing power of GPUs in order to

transcode video much faster than ever before. For example, if you want to transcode a DVD so it

will play on your iPod, it may take several hours to fully transcode. However, with Badaboom, it

is possible to transcode the movie or any video file faster than real time.

Video Enhancement

Complicated video enhancement techniques often require an enormous amount of computations.

For example, there are algorithms that can upscale a movie by using information from frames

surrounding the current frame. This involves too many computations for a CPU to handle in real

time. ArcSoft was able to create a plugin for it’s movie player which uses CUDA in order to

perform DVD upscaling in real time! This is an amazing feat, and greatly enhances any movie

watching experience if you have a high definition monitor. This is fine example of how

mainstream programs are harnessing the computational power of CUDA in order to delight their

customers. Another fine example would be vReveal, which is able to perform a variety of

enhancements to motion video, and then save the resulting video.

Oil and Natural Resource Exploration

The first two topics I talked about had to do with video, which is naturally suited for the video

card. Now it’s time to talk about more serious technologies involving oil, gas, and other natural

resource exploration. Using a variety of techniques, it is overwhelmingly difficult to construct a

3d view of what lies underground, expecially when the ground is deeply submerged in a sea.

Scientists used to work with very small sample sets, and low resolutions in order to find possible

sourses of oil. Because the ground reconstruction algorithms are highly parallel, CUDA is

perfectly suited to this type of challenge. Now CUDA is being used to find oil sources quicker.

Medical Imaging

CUDA is a significant advancement for the field of medical imaging. Using CUDA, MRI

machines can now compute images faster than ever possible before, and for a lower price. Before

CUDA, it used to take an entire day to make a diagnosis of breast cancer. Now with CUDA, this

can take 30 minutes. In fact, patients no longer need to wait 24 hours for the results, which will

benefit many people.

Computational Sciences

In the raw field of computational sciences, CUDA is very advantageous. For example, it is now

possible to use CUDA with MATLAB, which can increase computations by a great amount.

Other common tasks such as computing eigenvalues, or SVD decompositions, or other matrix

mathematics can use CUDA in order to speed up calculations.

Neural Networks

I personally worked on a program which required the training of several thousand neural

networks to a large set of training data. Using the Core 2 Duo CPU that was available to me, it

would have taken over a month to get a solution. However, with CUDA, I was able to reduce my

time to solution to under 12 hours.

Gate-level VLSI Simulation

In college, my friend and I were able to create a simple gate-level VLSI simulation tool which

used CUDA. Speedups were anywhere form 4x to 70x, depending on the circuit and stimulus to

the circuit.

Fluid Dynamics

Fluid dynamics simulations have also been created. These simulations require a huge number of

calculations, and are useful for wing design, and other engineering tasks.

Documents

dsrajnor.files.wordpress.com · Web viewHigh Performance Computing. Class : BE Computer Date:01/10/2019 Time: 09:30 to 12:00 Marks : 70 Duration : 2.5 Hr. Q 1(a) Explain SIMD, MIMD