Getting Started with CUDA Programming: Vector Addition
CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA for harnessing the power of Graphics Processing Units (GPUs) to accelerate general-purpose computation. In this tutorial, we will explore the basics of CUDA programming by creating a simple CUDA program for vector addition. We will cover key concepts such as GPU memory management, kernel functions, and data transfer between the host (CPU) and the device (GPU).
Prerequisites
Before you begin, ensure that you have the following prerequisites in place:
- NVIDIA GPU: You’ll need an NVIDIA GPU on your machine to perform CUDA programming. Most modern NVIDIA GPUs are supported.
- NVIDIA CUDA Toolkit: Install the NVIDIA CUDA Toolkit, which includes the CUDA compiler (
nvcc
) and libraries required for CUDA development. - C/C++ Knowledge: Basic knowledge of C or C++ programming is helpful but not mandatory.
Step 1: Setting Up Your Development Environment
First, make sure you have the NVIDIA CUDA Toolkit installed on your system. You can download it from the official NVIDIA website and follow the installation instructions provided for your specific platform.
Step 2: Creating a CUDA Source File
Create a new text file with a .cu
extension. CUDA source files typically use this extension to indicate that they contain CUDA C/C++ code. In this example, we’ll name the file vector_addition.cu
.
Step 3: Writing the CUDA Code
Open vector_addition.cu
in your favourite text editor or integrated development environment (IDE) and add the following code:
#include <iostream>
// CUDA kernel to add two vectors
__global__ void vectorAdd(int* a, int* b, int* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
int main() {
int n = 1024; // Number of elements in the vectors
// Host memory
int* h_a = new int[n];
int* h_b = new int[n];
int* h_c = new int[n];
// Initialize input vectors
for (int i = 0; i < n; ++i) {
h_a[i] = i;
h_b[i] = i * 2;
}
// Device memory
int* d_a;
int* d_b;
int* d_c;
cudaMalloc((void**)&d_a, sizeof(int) * n);
cudaMalloc((void**)&d_b, sizeof(int) * n);
cudaMalloc((void**)&d_c, sizeof(int) * n);
// Copy input vectors from host to device
cudaMemcpy(d_a, h_a, sizeof(int) * n, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, sizeof(int) * n, cudaMemcpyHostToDevice);
// Launch kernel
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy result from device to host
cudaMemcpy(h_c, d_c, sizeof(int) * n, cudaMemcpyDeviceToHost);
// Print the result
for (int i = 0; i < n; ++i) {
std::cout << h_c[i] << " ";
}
std::cout << std::endl;
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Free host memory
delete[] h_a;
delete[] h_b;
delete[] h_c;
return 0;
}
Step 4: Compiling the CUDA Code
To compile the CUDA code, use the nvcc
compiler provided by the CUDA Toolkit. Open your terminal or command prompt, navigate to the directory containing vector_addition.cu
, and execute the following command:
vcc -o vector_addition vector_addition.cu
This command tells nvcc
to compile vector_addition.cu
and generate an executable named vector_addition
.
Step 5: Running the CUDA Program
After successful compilation, run the CUDA program by executing the following command:
./vector_addition
You should see the result, which is the vector addition of h_a
and h_b
, printed to the console.
Conclusion
Congratulations! You’ve created a simple CUDA program for vector addition. This example covers the fundamentals of CUDA programming, including kernel functions, memory management, and data transfer between the host and device. You can use this knowledge as a foundation for more complex CUDA applications and explore GPU acceleration for various computational tasks.