Playing with cuda block size

Posted on February 4, 2019February 22, 2019 by Julia Nedzelskaya

Recently I was working on algorithm implementation using NVIDIA CUDA. For testing reasons I was using a tiny-toy data sample to check whether the algorithm worked as expected. I was concentrated on what I was doing, not on how it has to be done. And optimization was the last thing to do.

After I have reached the point when things went as supposed to, I faced the problem of CPU implementation working much faster than my CUDA one. And then I suddenly realized that while I was playing with the implementation, I have set the block size to 1 and grid size to the total number of threads I wanted to be run.

After I have fine-tuned the kernels invocation range, algorithm execution time dropped dramatically. It became absolutely clear to me why there was such a gap in the performance with a block size of 1. There is no SIMT advantage in this case, so threads run in a parallel way as it is possible to run multiple blocks/warps at the same time. However there is no advantage of, for example, global memory coalescing.

Thus I decided to write a dummy program to see how a block size affects kernel execution time.

Hardware

Device 0: "GeForce 840M"

CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2002 MBytes (2099642368 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores

...

Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

Test program

#include <iostream>
#include <sstream>

#include <cuda.h>
#include <curand.h>


void curandFillNormally(float *A, const size_t height, const size_t width = 1,
        const float mean = 0.0f, const float stddev = 1.0f) {
    curandGenerator_t prng;
    curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_MT19937);

    curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long) clock());

    curandGenerateNormal(prng, A, height * width, mean, stddev);
}

void cudaCheckError(cudaError_t status) {
    if (status != cudaSuccess) {
        std::stringstream ss;
        ss << "CUDA error: " << cudaGetErrorString(status);
        throw std::runtime_error(ss.str());
    }
}

__global__ void dummyKernel(const float * __restrict__ in,
			    float * __restrict__ out,
			    const uint32_t N) {
    const int tid = blockIdx.x * blockDim.x + threadIdx.x;
    const int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < N; i += stride) {
        // Some messy math expression.
        out[i] = sqrtf(expf(pow(in[i] / 4.0, 3.5)));
    }
}


int main(int argc, char **argv) {
    const size_t size = 100 * 1024 * 1024;

    float *data = nullptr;
    float *out = nullptr;

    cudaCheckError(cudaMalloc(&data, size * sizeof(float)));
    cudaCheckError(cudaMalloc(&out, size * sizeof(float)));

    curandFillNormally(data, size);

    for (size_t blockSize: {1, 2, 4, 8, 16, 32, 128, 256, 512, 1024}) {
        cudaEvent_t start, stop;
        cudaCheckError(cudaEventCreate(&start));
        cudaCheckError(cudaEventCreate(&stop));

        cudaEventRecord(start);
        dummyKernel<<<size / blockSize, blockSize>>>(data, out, size);
        cudaEventRecord(stop);

        cudaCheckError(cudaEventSynchronize(stop));
        float elapsedTime = 0.0f;
        cudaEventElapsedTime(&elapsedTime, start, stop);

        std::cout << "Block size: " << blockSize << ", elapsed time: " <<
            elapsedTime << " ms" << std::endl;
    }

    cudaCheckError(cudaFree(data));
    cudaCheckError(cudaFree(out));
}

Results

As we can see, starting from the block size of 32 the execution time graph is saturated, and there are no performance improvements. Even though we are moving towards the maximum number of threads per block.

It is noticeable that the saturation point is a size of the warp. As I do not use shared memory for this particular toy kernel, there is no performance improvement with further block size growth.

Conclusion

Simply do not forget that your block size should be at least equal to the warp size. If for some reason you use shared memory in your kernel, maximizing the block size seems to be the best solution for using all of the advantages of fast shared memory.

Hardware

Test program

Results

Conclusion

Leave a Reply Cancel reply