CUDA编程模型PPT课件.ppt

上传人：Facebook 文档编号：3833817 上传时间：2018-11-20 格式：PPT 页数：16 大小：1,015KB

下载相关举报

第1页 / 共16页

第2页 / 共16页

第3页 / 共16页

第4页 / 共16页

第5页 / 共16页

点击查看更多>>

资源描述

1、CUDA/GPU 编程模型,周斌 NVIDIA & USTC 2014年10月,内容,CPU和GPU互动模式GPU线程组织模型（不停强化）GPU存储模型基本的编程问题,CPU-GPU交互,各自的物理内存空间通过PCIE总线互连(8GB/s16GB/s) 交互开销较大, NVIDIA Corporation,GPU存储器层次架构（硬件）,访存速度,Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle Local Memory DRAM, no cache - *slow* Global M

2、emory DRAM, no cache - *slow* Constant Memory DRAM, cached, 110s100s of cycles, depending on cache locality Texture Memory DRAM, cached, 110s100s of cycles, depending on cache locality Instruction Memory (invisible) DRAM, cached,GPU架构回顾,GPU线程组织模型,线程组织架构说明,一个Kernel具有大量线程线程被划分成线程块blocks 一个block内部的线程共

3、享 Shared Memory 可以同步 _syncthreads()Kernel启动一个grid，包含若干线程块用户设定线程和线程块具有唯一的标识,GPU线程映射关系,GPU内存和线程等关系,12,编程模型,常规意义的GPU用于处理图形图像操作于像素，每个像素的操作都类似可以应用SIMD (single instruction multiple data),13,SIMD (Single Instruction Multiple Data),也可以认为是数据并行分割,14,Single Instruction Multiple Thread (SIMT),GPU版本的 SIMD大量线程模型

4、获得高度并行线程切换获得延迟掩藏多个线程执行相同指令流GPU上大量线程承载和调度,CUDA编程模式：Extended C,Declspecs global, device, shared, local, constant关键词 threadIdx, blockIdx Intrinsics _syncthreads运行期API Memory, symbol, execution management函数调用,_device_ float filterN; _global_ void convolve (float *image) _shared_ float regionM;. regionthreadIdx = imagei; _syncthreads() . imagej = result; / Allocate GPU memory void *myimage = cudaMalloc(bytes)/ 100 blocks, 10 threads per block convolve (myimage);,CUDA 函数声明,_global_ 定义一个 kernel 函数入口函数，CPU上调用，GPU上执行必须返回void _device_ and _host_ 可以同时使用,

展开阅读全文