西安交大并行计算理论赵银亮课件第五章.ppt-道客多多

资源描述

1、Analytical Modeling of Parallel Systems,Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar,To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003.,Topic Overview,Sources of Overhead in Parallel Programs Performance Metrics for Parallel Systems Effect of Granularity o

2、n Performance Scalability of Parallel Systems Minimum Execution Time and Minimum Cost-Optimal Execution Time Asymptotic Analysis of Parallel Programs Other Scalability Metrics,Analytical Modeling - Basics,A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a functio

3、n of input size). The asymptotic runtime of a sequential program is identical on any serial platform. The parallel runtime of a program depends on the input size, the number of processors, and the communication parameters of the machine. An algorithm must therefore be analyzed in the context of the

4、underlying platform. A parallel system is a combination of a parallel algorithm and an underlying platform.,Analytical Modeling - Basics,A number of performance measures are intuitive. Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a pa

5、rallel ensemble. But how does this scale when the number of processors is changed of the program is ported to another machine altogether? How much faster is the parallel version? This begs the obvious followup question - whats the baseline serial version with which we compare? Can we use a suboptima

6、l serial program to make our parallel program look Raw FLOP count - What good are FLOP counts when they dont solve a problem?,Sources of Overhead in Parallel Programs,If I use two processors, shouldnt my program run twice as fast? No - a number of overheads, including wasted computation, communicati

7、on, idling, and contention cause degradation in performance.,The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling.,Sources of Overheads in Parallel

8、Programs,Interprocess interactions: Processors working on any non-trivial parallel problem will need to talk to each other. Idling: Processes may idle because of load imbalance, synchronization, or serial components. Excess Computation: This is computation not performed by the serial version. This m

9、ight be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication.,Performance Metrics for Parallel Systems: Execution Time,Serial runtime of a program is the time elapsed between the beginning and the end of its exec

10、ution on a sequential computer. The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. We denote the serial runtime by and the parallel runtime by TP .,Performance Metrics for Parallel Systems: Total Parallel Over

11、head,Let Tall be the total time collectively spent by all the processing elements. TS is the serial time. Observe that Tall - TS is then the total time spend by all processors combined in non-useful work. This is called the total overhead. The total time collectively spent by all the processing elem

12、ents Tall = p TP (p is the number of processors). The overhead function (To) is therefore given by To = p TP - TS (1),Performance Metrics for Parallel Systems: Speedup,What is the benefit from parallelism? Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the tim

13、e required to solve the same problem on a parallel computer with p identical processing elements.,Performance Metrics: Example,Consider the problem of adding n numbers by using n processing elements. If n is a power of two, we can perform this operation in log n steps by propagating partial sums up

14、a logical binary tree of processors.,Performance Metrics: Example,Computing the globalsum of 16 partial sums using 16 processing elements . ji denotes the sum of numbers with consecutive labels from i to j.,Performance Metrics: Example (continued),If an addition takes constant time, say, tc and comm

15、unication of a single word takes time ts + tw, we have the parallel time TP = (log n)We know that TS = (n)Speedup S is given by S = (n / log n),Performance Metrics: Speedup,For a given problem, there might be many serial algorithms available. These algorithms may have different asymptotic runtimes a

16、nd may be parallelizable to different degrees. For the purpose of computing speedup, we always consider the best sequential program as the baseline.,Performance Metrics: Speedup Example,Consider the problem of parallel bubble sort. The serial time for bubblesort is 150 seconds. The parallel time for

17、 odd-even sort (efficient parallelization of bubble sort) is 40 seconds. The speedup would appear to be 150/40 = 3.75. But is this really a fair assessment of the system? What if serial quicksort only took 30 seconds? In this case, the speedup is 30/40 = 0.75. This is a more realistic assessment of

18、the system.,Performance Metrics: Speedup Bounds,Speedup can be as low as 0 (the parallel program never terminates). Speedup, in theory, should be upper bounded by p - after all, we can only expect a p-fold speedup if we use times as many resources. A speedup greater than p is possible only if each p

19、rocessing element spends less than time TS / p solving the problem. In this case, a single processor could be timeslided to achieve a faster serial program, which contradicts our assumption of fastest serial program as basis for speedup.Shrinking the problem size per processor May allow it to fit in

20、 small fast memory (cache) Application is not deterministic Amount of work varies depending on execution order Search algorithms have this characteristic,Performance Metrics: Superlinear Speedups,One reason for superlinearity is that the parallel version does less work than corresponding serial algo

21、rithm.,Searching an unstructured tree for a node with a given label, S, on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found

22、. The corresponding serial formulation expands the entire tree. It is clear that the serial algorithm does more work than the parallel algorithm.,Performance Metrics: Superlinear Speedups,Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better cache-hit ratios

23、, and therefore superlinearity. Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from local memory and 2% from remote memory. If DRAM access tim

24、e is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this corresponds to a speedup of 2.43!,Example: Superlinear Speedup,Problem size: W, cache hit rate: 80% Effective memory access time=2*0.8+100*0.2=21.6ns Processing rate=1/21.6GFLOPS=46.3MFLOPS, assume one FLOP/memory a

25、ccess,Problem size: W/2, cache hit rate: 90% Effective memory access time=2*0.9+100*0.08+400*0.02=17.8ns Processing rate=2/17.8GFLOPS=2*56.18MFLOPS=112.36 Speedup=112.36/46.3=2.43,Performance Metrics: Efficiency,Efficiency is a measure of the fraction of time for which a processing element is useful

26、ly employed Mathematically, it is given by = (2)Following the bounds on speedup, efficiency can be as low as 0 and as high as 1.,Performance Metrics: Efficiency Example,The speedup of adding numbers on processors is given by Efficiency is given by =,Parallel Time, Speedup, and Efficiency Example,Con

27、sider the problem of edge-detection in images. The problem requires us to apply a 3 x 3 template to each pixel. If each multiply-add operation takes time tc, the serial time for an n x n image is given by TS= tc n2. Example of edge detection: (a) an 8 x 8 image; (b) typical templates for detecting e

28、dges; and (c) partitioning of the image across four processors with shaded regions indicating image data that must be communicated from neighboring processors to processor 1.,Parallel Time, Speedup, and Efficiency Example (continued),One possible parallelization partitions the image equally into ver

29、tical segments, each with n2 / p pixels. The boundary of each segment is 2n pixels. This is also the number of pixel values that will have to be communicated. This takes time 2(ts + twn). Templates may now be applied to all n2 / p pixels in time TS = 9 tcn2 / p.,Parallel Time, Speedup, and Efficienc

30、y Example (continued),The total time for the algorithm is therefore given by: The corresponding values of speedup and efficiency are given by:and,Cost of a Parallel System,Cost is the product of parallel runtime and the number of processing elements used (p x TP ). Cost reflects the sum of the time

31、that each processing element spends solving the problem. A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is asymptotically identical to serial cost. Since E = TS / p TP, for cost optimal systems, E = O(1). Cost is sometimes referred to as work or

32、processor-time product.,Cost of a Parallel System: Example,Consider the problem of adding numbers on processors. We have, TP = log n (for p = n). The cost of this system is given by p TP = n log n. Since the serial runtime of this operation is (n), the algorithm is not cost optimal.,Impact of Non-Co

33、st Optimality,Consider a sorting algorithm that uses n processing elements to sort the list in time (log n)2. Since the serial runtime of a (comparison-based) sort is n log n, the speedup and efficiency of this algorithm are given by n / log n and 1 / log n, respectively. The p TP product of this al

34、gorithm is n (log n)2. This algorithm is not cost optimal but only by a factor of log n. If p n, assigning n tasks to p processors gives TP = n (log n)2 / p .The corresponding speedup of this formulation is p / log n. This speedup goes down as the problem size n is increased for a given p !,Effect o

35、f Granularity on Performance,Often, using fewer processors improves performance of parallel systems. Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system. A naive way of scaling down is to think of each processor

36、 in the original case as a virtual processor and to assign virtual processors equally to scaled down processors. Since the number of processing elements decreases by a factor of n / p, the computation at each processing element increases by a factor of n / p. The communication cost should not increa

37、se by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity.,Building Granularity: Example,Consider the problem of adding n numbers on p processing elements such that p n and

38、 both n and p are powers of 2. Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors. Each of the p processors is now assigned n / p virtual processors. The first log p of the log n steps of the original algorithm are simulated in (n / p) log p ste

39、ps on p processing elements. Subsequent log n - log p steps do not require any communication.,Building Granularity: Example (continued),The overall parallel execution time of this parallel system is ( (n / p) log p). The cost is (n log p), which is asymptotically higher than the (n) cost of adding n

40、 numbers sequentially. Therefore, the parallel system is not cost-optimal.,Building Granularity: Example (continued),Can we build granularity in the example in a cost-optimal fashion? Each processing element locally adds its n / p numbers in time (n / p). The p partial sums on p processing elements

41、can be added in time (n /p).A cost-optimal way of computing the sum of 16 numbers using four processing elements.,Building Granularity: Example (continued),The parallel runtime of this algorithm is (3)The cost is This is cost-optimal, so long as !,Scalability of Parallel Systems,How do we extrapolat

42、e performance from small problems and small systems to larger problems on larger configurations? Consider three parallel algorithms for computing an n-point Fast Fourier Transform (FFT) on 64 processing elements. A comparison of the speedups obtained by the binary-exchange, 2-D transpose and 3-D tra

43、nspose algorithms on 64 processing elements with tc = 2, tw = 4, ts = 25, and th = 2. Clearly, it is difficult to infer scaling characteristics from observations on small datasets on small machines.,Scaling Characteristics of Parallel Programs,The efficiency of a parallel program can be written as:

44、or (4) The total overhead function To is an increasing function of p .,Scaling Characteristics of Parallel Programs,For a given problem size (i.e., the value of TS remains constant), as we increase the number of processing elements, To increases. The overall efficiency of the parallel program goes d

45、own. This is the case for all parallel programs.,Scaling Characteristics of Parallel Programs: Example,Consider the problem of adding numbers on processing elements. We have seen that:= (5)= (6)= (7),Scaling Characteristics of Parallel Programs: Example (continued),Plotting the speedup for various i

46、nput sizes gives us: Speedup versus the number of processing elements for adding a list of numbers. Speedup tends to saturate and efficiency drops as a consequence of Amdahls law,Scaling Characteristics of Parallel Programs,Total overhead function To is a function of both problem size Ts and the num

47、ber of processing elements p.In many cases, To grows sublinearly with respect to Ts. In such cases, the efficiency increases if the problem size is increased keeping the number of processing elements constant. For such systems, we can simultaneously increase the problem size and number of processors

48、 to keep efficiency constant. We call such systems scalable parallel systems.,Scaling Characteristics of Parallel Programs,Recall that cost-optimal parallel systems have an efficiency of (1).Scalability and cost-optimality are therefore related.A scalable parallel system can always be made cost-opti

49、mal if the number of processing elements and the size of the computation are chosen appropriately.,Isoefficiency Metric of Scalability,For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down for all systems.For some systems,

50、 the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant.,Isoefficiency Metric of Scalability,Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems.,

展开阅读全文