# Very Quick Introduction to CUDA

#### **Burak Himmetoglu**

Supercomputing Consultant

Enterprise Technology Services & Center for Scientific Computing University of California Santa Barbara

e-mail: <u>bhimmetoglu@ucsb.edu</u>



- CPUs are **latency** oriented (minimize execution of serial code)
- GPUs are **throughput** oriented (maximize number of floating point operations)

# Data Parallelism

Eg. vector addition, serial vs. parallel





# CUDA C

- Compute Unified Device Architecture
- NVIDIA GPUs can be programmed by CUDA, extension of C language (CUDA Fortran is also available)
- CUDA C is compiled with nvcc
- Host —> CPU; Device —> GPU (They do not share memory!)
- The HOST launches a kernel that execute on the DEVICE
- A kernel is a data-parallel computation, executed by many threads.
- The number of threads are very large (~ 1000 or more)



#### **Thread Organization**

# CUDA C

- Threads are grouped into blocks.
- Each block shares memory.

Eg. Vector addition:



## CUDA C

 Grids and threads can also be arranged in 2d arrays (useful for image processing)

```
dim3 blocks(2,2)
dim3 threads(16,16)
• • • •
kernel <<< blocks, threads >>>( );
• • •
                                                      Thread
                                                                Thread
                                                       (0,0)
                                                                 (1,0)
          block(0,0)
                          block(1,0)
                                                      Thread
                                                                Thread
                          block(1,1)
          block(0,1)
                                                      (0,15)
                                                                 (1, 15)
```

Hello World!



Palt host thread execution on CPU until the device has finish processing all previously requested tasks.

Vector Addition (Very large vectors)

```
__global__ void add( int *a, int *b, int *c){
    int tid = threadIdx.x + blockIdx.x * blockDim.x ; // handle the data at this index
    while (tid < N) {
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }
}</pre>
```

```
e.g.: blockDim = 4, gridDim = 4
```

th 0 th 1 th 2 th 3



Vector Addition (Very large vectors)

```
__global__ void add( int *a, int *b, int *c){
    int tid = threadIdx.x + blockIdx.x * blockDim.x ; // handle the data at this index
    while (tid < N) {
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }
}</pre>
```

e.g.: N = 256, blockDim = 2, gridDim = 2 --> offset = blockDim \* gridDim



• Define arrays to be used on the HOST, and allocate memory.

```
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// Allocate memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
cudaMalloc( (void**)&dev_b, N * sizeof(int) );
cudaMalloc( (void**)&dev_c, N * sizeof(int) );
```

#### Copy arrays to the DEVICE

```
//Copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice );
```

Launch the kernel, then copy result from DEVICE to HOST
 add<<<128,128>>>( dev\_a, dev\_b, dev\_c) ; // Launch N=128 blocks each containing M=128 threads
 //Copy the array 'c' back from the GPU to the CPU

```
cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost );
```

#### • Free memory

```
//Free memory
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
```

#### Dot product



- Recall, each Block shares memory!
- Each block will have a its own copy of **cahce[]**, i.e. a partial result.
- Final step is reduction, i.e. summing all the partial results in **cahce[]** to obtain a final answer.

