CUDA 矩阵乘法优化.pdf-道客多多

资源描述

1、G P U A s s i g n m e n t 5 K K 7 0 IntroductionPreparationLearning MaterialsExamplesMatrixMul ExampleMemory accessgrouping exampleAssignmentContrast EnhancementKnown IssuesSitemapMatrixMul ExampleMatrix multiplication is a fundamental building block for scientific computation. While solving linear

2、equations andfinding eigenvalues, the workload is dominated by matrix multiplication. As scientific computing is an importantapplication domain in GPU computing, optimizing matrix multiplication on GPU is the key to achieve high performancein this domain.Contents1 The matrixMul problem2 Serial Imple

3、mentation on CPU3 Naive Implementation on GPU4 Increase Computation to Memory Ratio by Tiling5 Memory Coalescing6 Avoiding memory bank conflict7 Multiply / Add Balancing8 Loop unrolling9 PrefetchingThe matrixMul problemGiven an M x K matrix A and a K x N matrix B, multiply A with B and store the res

4、ult into a M x N matrix C.The matrixMul example on this page will show several techniques to optimize matrix multiplication on GPU. Most ofthem are generic, which can be applied to other applications. These techniques are:1. Tiling2. Memory coalescing3. Avoiding memory bank conflicts4. Increase floa

5、ting portion by outer product.5. Loop unrolling6. PrefetchingThe performance of these optimization techniques are show in the figures below.We will start with a simple serial code running on CPU, and then go through these optimizations step by step. Search this siteconverted by Web2PDFCThe source co

6、de of these examples is available in the attachment of this page (click to download). Unzip the packageto C/src path to compile.Serial Implementation on CPUvoid main() define A, B, Cfor i = 0 to M do for j = 0 to N do/* compute element C(i,j) */ for k = 0 to K do C(i,j) (A_gpu,B_gpu,C_gpu,K) memcopy

7、 C_gpu to C_cpu /* Codes running on GPU */_global_ void matrixMul(A_gpu,B_gpu,C_gpu,K)temp = 0 i = blockIdx.y * blockDim.y + threadIdx.y / Row i of matrix Cj = blockIdx.x * blockDim.x + threadIdx.x / Column j of matrix C for k = 0 to K-1 doaccu = accu + A_gpu(i,k) * B_gpu(k,j) endC_gpu(i,j) = accu c

8、onverted by Web2PDFCA naive implementation on GPU assigns one thread to compute one element of matrix C. Each thread loads one row of matrix A and onecolumn of matrix B from global memory, do the inner product, and store the result back to matrix C in the global memory. The figureshows the memory fo

9、otprint of one thread on global memory where matrix A, B, and C are stored. In the naive implementation, the amount of computation is 2 x M x N x K flop, while the amount of global memory access is 2 x M x N x Kword. The computation to memory ratio is approximately 1/4 (flop/byte). Therefore, the na

10、ive implementation is bandwidth bounded. Asshown in the roofline model below.Increase Computation to Memory Ratio by TilingTo increase the computation to memory ratio, the tiled matrix multiplication can be applied. One thread block computes one tile of matrix C.One thread in the thread block comput

11、es one element of the tile. The figure shows a 32 x 32 matrix divided into four 16 x 16 tiles. Tocompute this, four thread blocks each with 16 x 16 threads can be created.The GPU kernel computes C in multiple iterations. In each iteration, one thread block loads one tile of A and one tile of B from

12、globalmemory to shared memory, performs computation, and stores temporal result of C in register. After all the iteration is done, the threadblock stores one tile of C into global memory. For example, a thread block can computer C0,0 in two iterations: C0,0 = A0,0 B0,0 +A0,1 B1,0.In the first iterat

13、ion, the thread block loads tile A0,0 and tile B0,0 from global memory into shared memory. Each thread performs innerconverted by Web2PDFCproduct to produce one element of C. This element of C is stored in the register, which will be accumulated in the next iteration.In the second iteration, the thr

14、ead block loads tile A0,1 and tile B1,0 from global memory into shared memory. Each thread performs theinner product to produce one element of C, which is accumulated with previous value. If this is the final iteration, then the element of C inthe register file will be stored back into global memory

15、.The CPU code remains the same. Here only shows the GPU kernel./* Codes running on GPU */_global_ void matrixMul(A_gpu,B_gpu,C_gpu,K)_shared_ float A_tile(blockDim.y, blockDim.x) _shared_ float B_tile(blockDim.x, blockDim.y)accu = 0 converted by Web2PDFC Accumulate C tile by tile. */for tileIdx = 0

16、to (K/blockDim.x - 1) do /* Load one tile of A and one tile of B into shared mem */ Row i of matrix A i = blockIdx.y * blockDim.y + threadIdx.y / Column j of matrix Aj = tileIdx * blockDim.x + threadIdx.x / Load A(i,j) to shared mem A_tile(threadIdx.y, threadIdx.x) = A_gpu(i,j) / Load B(j,i) to shar

17、ed mem B_tile(threadIdx.x, threadIdx.y) = B_gpu(j,i) / Non coalesced/ Synchronize before computation _sync() /* Accumulate one tile of C from tiles of A and B in shared mem */ for k = 0 to threadDim.x do/ Accumulate for matrix C accu = accu + A_tile(threadIdx.y,k) * B_tile(k,threadIdx.x) end/ Synchr

18、onize _sync() end / Row i of matrix Ci = blockIdx.y * blockDim.y + threadIdx.y / Column j of matrix C j = blockIdx.x * blockDim.x + threadIdx.x / Store accumulated value to C(i,j) C_gpu(i,j) = accu In the tiled implementation, the amount of computation is still 2 x M x N x K flop. However, using til

19、e size of B, the amount of global memoryaccess is 2 x M x N x K / B word. The computation to memory ratio is approximately B/4 (flop/byte). We now can tune the computation tomemory ratio by changing the tile size B. The roofline model is shown below.Memory CoalescingTwo dimensional arrays in C/C+ ar

20、e row-major. In the tiled implementation above, neighbouring threads have coalesced access to matrixA, but do not have coalesced access to matrix B. In column-major languages, such as Fortran, the problem is the other way around. Anobvious solution is to transpose matrix B by CPU before offloading i

21、t to GPU memory./* Codes running on GPU */_global_ void matrixMul(A_gpu,B_gpu,C_gpu,K)_shared_ float A_tile(blockDim.y, blockDim.x) _shared_ float B_tile(blockDim.x, blockDim.y)accu = 0 /* Accumulate C tile by tile. */for tileIdx = 0 to (K/blockDim.x - 1) do /* Load one tile of A and one tile of B i

22、nto shared mem */ Row i of matrix A i = blockIdx.y * blockDim.y + threadIdx.y / Column j of matrix Aj = tileIdx * blockDim.x + threadIdx.x / Load A(i,j) to shared mem A_tile(threadIdx.y, threadIdx.x) = A_gpu(i,j) / Load B(i,j) to shared mem B_tile(threadIdx.x, threadIdx.y) = B_gpu(i,j) / Coalesced/

23、Synchronize before computation _sync() /* Accumulate one tile of C from tiles of A and B in shared mem */ for k = 0 to threadDim.x do/ Accumulate for matrix C / Bank conflict accu = accu + A_tile(threadIdx.y,k) * B_tile(threadIdx.x,k) end/ Synchronize _sync() end / Row i of matrix Ci = blockIdx.y *

24、blockDim.y + threadIdx.y / Column j of matrix C j = blockIdx.x * blockDim.x + threadIdx.x / Store accumulated value to C(i,j) C_gpu(i,j) = accu converted by Web2PDFCAvoiding memory bank conflict/* Codes running on GPU */_global_ void matrixMul(A_gpu,B_gpu,C_gpu,K)_shared_ float A_tile(blockDim.y, bl

25、ockDim.x) _shared_ float B_tile(blockDim.x, blockDim.y)accu = 0 /* Accumulate C tile by tile. */for tileIdx = 0 to (K/blockDim.x - 1) do /* Load one tile of A and one tile of B into shared mem */ Row i of matrix A i = blockIdx.y * blockDim.y + threadIdx.y / Column j of matrix Aj = tileIdx * blockDim

26、.x + threadIdx.x / Load A(i,j) to shared mem A_tile(threadIdx.y, threadIdx.x) = A_gpu(i,j) / Load B(i,j) to shared mem B_tile(threadIdx.y, threadIdx.x) = B_gpu(i,j) / No Bank conflict / Synchronize before computation_sync() /* Accumulate one tile of C from tiles of A and B in shared mem */for k = 0

27、to threadDim.x do / Accumulate for matrix C / No Bank conflict accu = accu + A_tile(threadIdx.y,k) * B_tile(k,threadIdx.x) end / Synchronize _sync() end / Row i of matrix Ci = blockIdx.y * blockDim.y + threadIdx.y / Column j of matrix C j = blockIdx.x * blockDim.x + threadIdx.x / Store accumulated v

28、alue to C(i,j) C_gpu(i,j) = accu Multiply / Add BalancingThe kernel is computation bound. Therefore, we need to increase the portion of useful floating point operation in total instructions. Becausethe inner product consumes most of the time, it is important to make sure this part is efficient. If w

29、e check the binary code for the innerproduct, we will discover one line of code in CUDA takes two instructions in the binary. /* CUDA code for inner product */accu = accu + A_tile(threadIdx.y,k) * B_tile(k, threadIdx.x)/* Disassembled from cubin binary */mov.b32 $r0, s$ofs4+0x0000mad.rn.f32 $r9, s$o

30、fs1+0x002c, $r0, $r9The current architecture of Stream Multiprocessor (SM) only allows one source operand from the shared memroy. However, computing theinner product requires two source operands from from the shared memory. One solution is to store matrix A or matrix B into register file,but then th

31、e matrix in the register file can not be shared by different threads, which decreases the computation to memory ratio.A better solution is to perform outer product instead of inner product. In this case, matrix A is stored in shared memory, but matrix B andCare stored in registers. The outer product

32、 does not require sharing of matrix B and matrix C, therefore, each thread only stores oneelement of B and one column of the tile of C in the register. The computation to memory ratio of the outer product is the same as the innerproduct./* CUDA code for outer product */converted by Web2PDFC CUDA cod

33、e for outer product */* accui and b are stored in register file */accui = accui + A_tile(i) * b/* Disassembled from cubin binary */mad.rn.f32 $r9, s$ofs2+0x0010, $r29, $r9Here is an example of multiplying tile A0,0 and tile B0,0 to compute C0,0 using outer product. In this example, A0,0 is 16 x 16,

34、B0,0is 16 x 64, C0,0 is 16 x 64. A thread block of 64 threads is performing computing C0,0.Step 1: load A0,0 to shared memory.Step 2: use 16 iterations to update C0,0. Each thread stores one element of B0,0 in its register. Each thread also stores one column ofC0,0 in its register.Iteration 1: outer

35、 product between the first column of A0,0 and the first row of B0,0, and update C0,0.Iteration 2: outer product between the second column of A0,0 and the second row of B0,0, and update C0,0.converted by Web2PDFCContinue the iteration 3, 4, ., 15 in similar way.Iteration 16: outer product between the

36、 16th column of A0,0 and the 16th row of B0,0, and update C0,0.Step 3: each thread stores one column of C0,0 from its register to global memory.converted by Web2PDFCLoop unrollingUse the pragma to tell the compiler to unroll the loops. The nvcc will unroll the inner loops by default. But it will not

37、 unroll the outer loopunless told by the pragma.#pragma unrollLoop unrolling sometimes has side effects on register usage, which may limit the number of concurrent threads. However, the loop unrollingdoes not increase register usage in the matrixMul example.Prefetching/* Codes running on GPU */_glob

38、al_ void matrixMul(A_gpu,B_gpu,C_gpu,K)_shared_ float A_tile0(blockDim.y, blockDim.x) _shared_ float A_tile1(blockDim.x, blockDim.y)float *pointer0 = A_tile0 float *pointer1 = A_tile1fetch one tile of matrix A_gpu to pointer0 _sync()/* Accumulate C tile by tile. */ for tileIdx = 0 to (K/blockDim.x -

39、 1) doprefetch one tile of matrix A_gpu to pointer1 accumulate C using pointer0_sync() swap pointer0 and pointer1 end store tile C to global memory Attachments (1)converted by Web2PDFC5kk70matrixMul.zip - on Nov 5, 2009 7:24 AM by zhenyu ye (version 1) 47k DownloadSign in Recent Site Activity Terms Report Abuse Print page | Powered by Google Sitesconverted by Web2PDFC

展开阅读全文