| | |

Getting Started with CUDA Programming: Vector Addition

CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA for harnessing the power of Graphics Processing Units (GPUs) to accelerate general-purpose computation. In this tutorial, we will explore the basics of CUDA programming by creating a simple CUDA program for vector addition. We will cover key concepts such as GPU memory management, kernel functions, and data transfer between the host (CPU) and the device (GPU).


Before you begin, ensure that you have the following prerequisites in place:

  1. NVIDIA GPU: You’ll need an NVIDIA GPU on your machine to perform CUDA programming. Most modern NVIDIA GPUs are supported.
  2. NVIDIA CUDA Toolkit: Install the NVIDIA CUDA Toolkit, which includes the CUDA compiler (nvcc) and libraries required for CUDA development.
  3. C/C++ Knowledge: Basic knowledge of C or C++ programming is helpful but not mandatory.

Step 1: Setting Up Your Development Environment

First, make sure you have the NVIDIA CUDA Toolkit installed on your system. You can download it from the official NVIDIA website and follow the installation instructions provided for your specific platform.

Step 2: Creating a CUDA Source File

Create a new text file with a .cu extension. CUDA source files typically use this extension to indicate that they contain CUDA C/C++ code. In this example, we’ll name the file vector_addition.cu.

Step 3: Writing the CUDA Code

Open vector_addition.cu in your favourite text editor or integrated development environment (IDE) and add the following code:

#include <iostream>

// CUDA kernel to add two vectors
__global__ void vectorAdd(int* a, int* b, int* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];

int main() {
    int n = 1024; // Number of elements in the vectors

    // Host memory
    int* h_a = new int[n];
    int* h_b = new int[n];
    int* h_c = new int[n];

    // Initialize input vectors
    for (int i = 0; i < n; ++i) {
        h_a[i] = i;
        h_b[i] = i * 2;

    // Device memory
    int* d_a;
    int* d_b;
    int* d_c;

    cudaMalloc((void**)&d_a, sizeof(int) * n);
    cudaMalloc((void**)&d_b, sizeof(int) * n);
    cudaMalloc((void**)&d_c, sizeof(int) * n);

    // Copy input vectors from host to device
    cudaMemcpy(d_a, h_a, sizeof(int) * n, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, sizeof(int) * n, cudaMemcpyHostToDevice);

    // Launch kernel
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy result from device to host
    cudaMemcpy(h_c, d_c, sizeof(int) * n, cudaMemcpyDeviceToHost);

    // Print the result
    for (int i = 0; i < n; ++i) {
        std::cout << h_c[i] << " ";
    std::cout << std::endl;

    // Free device memory

    // Free host memory
    delete[] h_a;
    delete[] h_b;
    delete[] h_c;

    return 0;

Step 4: Compiling the CUDA Code

To compile the CUDA code, use the nvcc compiler provided by the CUDA Toolkit. Open your terminal or command prompt, navigate to the directory containing vector_addition.cu, and execute the following command:

vcc -o vector_addition vector_addition.cu

This command tells nvcc to compile vector_addition.cu and generate an executable named vector_addition.

Step 5: Running the CUDA Program

After successful compilation, run the CUDA program by executing the following command:


You should see the result, which is the vector addition of h_a and h_b, printed to the console.


Congratulations! You’ve created a simple CUDA program for vector addition. This example covers the fundamentals of CUDA programming, including kernel functions, memory management, and data transfer between the host and device. You can use this knowledge as a foundation for more complex CUDA applications and explore GPU acceleration for various computational tasks.

Similar Posts