1. GPU
- The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. But we can implement any algorithm, not only graphics with high efficiency and performance.
GPU computing - features:
- Massively parallel.
- Hundreds of cores.
- Thousands of threads.
- Cheap.
- Highly available.
- Computing:
+ 1 Teraflop (Single precision)
+ 100 Gflops (Double precision)
- Programable: CUDA
- Important factors to consider: power and cooling.
- The amount of RAM featured on a certain GPU is typically the first spec listed when naming the card. Basically, the more vRAM a card has, the more complex tasks it can load. If the tasks run overload your GPU’s vRAM, the overflow goes to system RAM, significantly impacting performance in a negative way.
- On Linux and NVIDIA GPU, we can use the command "nvidia-smi" or "watch -n 1 nvidia-smi" or "nvidia-settings" to show GPU memory used and the processes using the GPU.
2. Compute Unified Device Architecture (CUDA)
Its characteristics:
- A compiler and toolkit for programming NVIDIA GPUs.
- API extends the C programming language.
- Runs on thousands of threads.
- Scalable model.
- Parallelism.
- Give a high level abstraction from hardware.
- CUDA language is vendor dependent.
- OpenCL is going to become an industry standard. It is a low level specification, more complex to program with than CUDA C.
3. CUDA architecture
Abstracting from the hardware.
Automatic Thread management (can handle +100k threads).Languages: C, C++, OpenCL.
OS: Windows, Linux, OS X.
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
- Analyze algorithm for exposing parallelism such as Block size, Number of threads.
- Resources management (efficient data transfers, Local data set - Register space that are limited on-chip memory)
4. Grid - Block - Thread - Kernel
4.1 Grid
The set of blocks is referred to as a grid.
4.2 Block
- Blocks are grouped in a grid.
- Blocks are independent.
- Each invocation can refer to its block index using blockIdx.x.
4.3 Thread
- Kernels are executed by thread.
- Each thread has a ID.
- Thousands of threads execute same kernel.
- Threads are grouped into blocks.
- Threads in a block can synchronize execution.
- Each invocation can refer to its block index using threadIdx.x.
4.4 Kernel
- A kernel is a simple C program.
Block size: (x, y, z). x*y*z = Maximum of 768 threads total. (Hw dependent)
Grid size: (x, y). Maximum of thousands of threads. (Hw dependent)
__global__ : to be called by the host but executed by the GPU.
__host__ : to be called and executed by the host.
__shared__ : variable in shared memory.
__syncthreads() : sync of threads within a block.
*Indexing Arrays with Blocks and Threads
Consider indexing an array with one element per thread (8 threads/block)
int index = threadIdx.x + blockIdx.x * M;
or using
int index = threadIdx.x + blockIdx.x * blockDim.x; // blockDim.x number of threads
- The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. But we can implement any algorithm, not only graphics with high efficiency and performance.
GPU computing - features:
- Massively parallel.
- Hundreds of cores.
- Thousands of threads.
- Cheap.
- Highly available.
- Computing:
+ 1 Teraflop (Single precision)
+ 100 Gflops (Double precision)
- Programable: CUDA
- Important factors to consider: power and cooling.
Figure: GPU vs CPU
Note:- The amount of RAM featured on a certain GPU is typically the first spec listed when naming the card. Basically, the more vRAM a card has, the more complex tasks it can load. If the tasks run overload your GPU’s vRAM, the overflow goes to system RAM, significantly impacting performance in a negative way.
- On Linux and NVIDIA GPU, we can use the command "nvidia-smi" or "watch -n 1 nvidia-smi" or "nvidia-settings" to show GPU memory used and the processes using the GPU.
2. Compute Unified Device Architecture (CUDA)
Its characteristics:
- A compiler and toolkit for programming NVIDIA GPUs.
- API extends the C programming language.
- Runs on thousands of threads.
- Scalable model.
- Parallelism.
- Give a high level abstraction from hardware.
- CUDA language is vendor dependent.
- OpenCL is going to become an industry standard. It is a low level specification, more complex to program with than CUDA C.
3. CUDA architecture
Abstracting from the hardware.
Automatic Thread management (can handle +100k threads).Languages: C, C++, OpenCL.
OS: Windows, Linux, OS X.
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Figure: CUDA architecture
User just take care:- Analyze algorithm for exposing parallelism such as Block size, Number of threads.
- Resources management (efficient data transfers, Local data set - Register space that are limited on-chip memory)
4. Grid - Block - Thread - Kernel
4.1 Grid
The set of blocks is referred to as a grid.
4.2 Block
- Blocks are grouped in a grid.
- Blocks are independent.
- Each invocation can refer to its block index using blockIdx.x.
4.3 Thread
- Kernels are executed by thread.
- Each thread has a ID.
- Thousands of threads execute same kernel.
- Threads are grouped into blocks.
- Threads in a block can synchronize execution.
- Each invocation can refer to its block index using threadIdx.x.
4.4 Kernel
- A kernel is a simple C program.
Figure: Block - Thread - Kernel
5. Work flow
Figure: GPU calculation Work flow
6. C extensionsBlock size: (x, y, z). x*y*z = Maximum of 768 threads total. (Hw dependent)
Grid size: (x, y). Maximum of thousands of threads. (Hw dependent)
__global__ : to be called by the host but executed by the GPU.
__host__ : to be called and executed by the host.
__shared__ : variable in shared memory.
__syncthreads() : sync of threads within a block.
*Indexing Arrays with Blocks and Threads
Consider indexing an array with one element per thread (8 threads/block)
Figure: block - thread Id
With M threads per block, a unique index for each thread is given by:int index = threadIdx.x + blockIdx.x * M;
or using
int index = threadIdx.x + blockIdx.x * blockDim.x; // blockDim.x number of threads
Figure: element 21 will be hamdle by thread 5 following threadIdx.x + blockIdx.x * M = 5+2*8
Example: Vector Addition with Blocks and Threads1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | #define N (2048*2048) #define THREADS_PER_BLOCK 512 __global__ void add(int *a, int *b, int *c, int n) { int index = threadIdx.x + blockIdx.x * blockDim.x; // Avoid accessing beyond the end of the arrays if (index < n) c[index] = a[index] + b[index]; } int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with THREADS_PER_BLOCK threads add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c, N); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } |
0 Comments