GPU concept

1. GPU
- The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. But we can implement any algorithm, not only graphics with high efficiency and performance.
GPU computing - features:
- Massively parallel.
- Hundreds of cores.
- Thousands of threads.
- Cheap.
- Highly available.
- Computing:
    + 1 Teraflop (Single precision)
    + 100 Gflops (Double precision)
- Programable: CUDA
- Important factors to consider: power and cooling.
Figure: GPU vs CPU
Note:
- The amount of RAM featured on a certain GPU is typically the first spec listed when naming the card. Basically, the more vRAM a card has, the more complex tasks it can load. If the tasks run overload your GPU’s vRAM, the overflow goes to system RAM, significantly impacting performance in a negative way.
- On Linux and NVIDIA GPU, we can use the command "nvidia-smi" or "watch -n 1 nvidia-smi" or "nvidia-settings" to show GPU memory used and the processes using the GPU.
2. Compute Unified Device Architecture (CUDA)
Its characteristics:
- A compiler and toolkit for programming NVIDIA GPUs.
- API extends the C programming language.
- Runs on thousands of threads.
- Scalable model.
- Parallelism.
- Give a high level abstraction from hardware.
- CUDA language is vendor dependent.
- OpenCL is going to become an industry standard. It is a low level specification, more complex to program with than CUDA C.
3. CUDA architecture
Abstracting from the hardware.
Automatic Thread management (can handle +100k threads).Languages: C, C++, OpenCL.
OS: Windows, Linux, OS X.
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Figure: CUDA architecture
User just take care:
- Analyze algorithm for exposing parallelism such as Block size, Number of threads.
- Resources management (efficient data transfers, Local data set - Register space that are limited on-chip memory)
4. Grid - Block - Thread - Kernel
4.1 Grid
The set of blocks is referred to as a grid.
4.2 Block
- Blocks are grouped in a grid.
- Blocks are independent.
- Each invocation can refer to its block index using blockIdx.x.
4.3 Thread
- Kernels are executed by thread.
- Each thread has a ID.
- Thousands of threads execute same kernel.
- Threads are grouped into blocks.
- Threads in a block can synchronize execution.
- Each invocation can refer to its block index using threadIdx.x.
4.4 Kernel
- A kernel is a simple C program.
Figure: Block - Thread - Kernel
5. Work flow
Figure: GPU calculation Work flow
6. C extensions
Block size: (x, y, z). x*y*z = Maximum of 768 threads total. (Hw dependent)
Grid size: (x, y). Maximum of thousands of threads. (Hw dependent)
__global__ : to be called by the host but executed by the GPU.
__host__ : to be called and executed by the host.
__shared__ : variable in shared memory.
__syncthreads() : sync of threads within a block.
*Indexing Arrays with Blocks and Threads
Consider indexing an array with one element per thread (8 threads/block)
Figure: block - thread Id
With M threads per block, a unique index for each thread is given by:
int index = threadIdx.x + blockIdx.x * M;
or using
int index = threadIdx.x + blockIdx.x * blockDim.x; // blockDim.x number of threads
Figure: element 21 will be hamdle by thread 5 following threadIdx.x + blockIdx.x * M = 5+2*8
Example: Vector Addition with Blocks and Threads
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#define N (2048*2048)
#define THREADS_PER_BLOCK 512

__global__ void add(int *a, int *b, int *c, int n) { 
 int index = threadIdx.x + blockIdx.x * blockDim.x;
 // Avoid accessing beyond the end of the arrays
 if (index < n) 
  c[index] = a[index] + b[index]; 
}

int main(void) {
 int *a, *b, *c; // host copies of a, b, c
 int *d_a, *d_b, *d_c; // device copies of a, b, c
 int size = N * sizeof(int);
 
 // Alloc space for device copies of a, b, c
 cudaMalloc((void **)&d_a, size);
 cudaMalloc((void **)&d_b, size);
 cudaMalloc((void **)&d_c, size);
 
 // Alloc space for host copies of a, b, c and setup input values
 a = (int *)malloc(size); random_ints(a, N);
 b = (int *)malloc(size); random_ints(b, N);
 c = (int *)malloc(size);
 
 // Copy inputs to device
 cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
 cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
 
 // Launch add() kernel on GPU with THREADS_PER_BLOCK threads
 add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c, N);
 
 // Copy result back to host
 cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
 
 // Cleanup
 free(a); free(b); free(c);
 cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
 return 0;
}

Post a Comment

0 Comments