{"id":82,"date":"2019-02-04T14:27:04","date_gmt":"2019-02-04T14:27:04","guid":{"rendered":"https:\/\/new.nestlogic.com\/?p=82"},"modified":"2019-02-22T14:49:21","modified_gmt":"2019-02-22T14:49:21","slug":"playing-with-cuda-block-size","status":"publish","type":"post","link":"https:\/\/nestlogic.com\/index.php\/2019\/02\/04\/playing-with-cuda-block-size\/","title":{"rendered":"Playing with cuda block size"},"content":{"rendered":"\n<p>Recently I was working on algorithm implementation using NVIDIA CUDA. For testing reasons I was using a tiny-toy data sample to check whether the algorithm worked as expected. I was concentrated on what I was doing, not on how it has to be done. And optimization was the last thing to do.<\/p>\n\n\n\n<p>After I have reached the point when things went as supposed to, I faced the problem of CPU implementation working much faster than my CUDA one. And then I suddenly realized that while I was playing with the implementation, I have set the block size to 1 and grid size to the total number of threads I wanted to be run.<\/p>\n\n\n\n<p>After I have fine-tuned the kernels invocation range, algorithm execution time dropped dramatically. It became absolutely clear to me why there was such a gap in the performance with a block size of 1. There is no SIMT advantage in this case, so threads run in a parallel way as it is possible to run multiple blocks\/warps at the same time. However there is no advantage of, for example, global memory coalescing.<\/p>\n\n\n\n<p>Thus I decided to write a dummy program to see how a block size affects kernel execution time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Hardware<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>Device 0: \"GeForce 840M\"\n\nCUDA Driver Version \/ Runtime Version 8.0 \/ 8.0\nCUDA Capability Major\/Minor version number: 5.0\nTotal amount of global memory: 2002 MBytes (2099642368 bytes)\n( 3) Multiprocessors, (128) CUDA Cores\/MP: 384 CUDA Cores\n\n...\n\nMaximum number of threads per multiprocessor: 2048\nMaximum number of threads per block: 1024\nMax dimension size of a thread block (x,y,z): (1024, 1024, 64)\nMax dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Test program<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;iostream>\n#include &lt;sstream>\n\n#include &lt;cuda.h>\n#include &lt;curand.h>\n\n\nvoid curandFillNormally(float *A, const size_t height, const size_t width = 1,\n        const float mean = 0.0f, const float stddev = 1.0f) {\n    curandGenerator_t prng;\n    curandCreateGenerator(&amp;prng, CURAND_RNG_PSEUDO_MT19937);\n\n    curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long) clock());\n\n    curandGenerateNormal(prng, A, height * width, mean, stddev);\n}\n\nvoid cudaCheckError(cudaError_t status) {\n    if (status != cudaSuccess) {\n        std::stringstream ss;\n        ss &lt;&lt; \"CUDA error: \" &lt;&lt; cudaGetErrorString(status);\n        throw std::runtime_error(ss.str());\n    }\n}\n\n__global__ void dummyKernel(const float * __restrict__ in,\n\t\t\t    float * __restrict__ out,\n\t\t\t    const uint32_t N) {\n    const int tid = blockIdx.x * blockDim.x + threadIdx.x;\n    const int stride = blockDim.x * gridDim.x;\n    \n    for (int i = tid; i &lt; N; i += stride) {\n        \/\/ Some messy math expression.\n        out[i] = sqrtf(expf(pow(in[i] \/ 4.0, 3.5)));\n    }\n}\n\n\nint main(int argc, char **argv) {\n    const size_t size = 100 * 1024 * 1024;\n\n    float *data = nullptr;\n    float *out = nullptr;\n\n    cudaCheckError(cudaMalloc(&amp;data, size * sizeof(float)));\n    cudaCheckError(cudaMalloc(&amp;out, size * sizeof(float)));\n\n    curandFillNormally(data, size);\n\n    for (size_t blockSize: {1, 2, 4, 8, 16, 32, 128, 256, 512, 1024}) {\n        cudaEvent_t start, stop;\n        cudaCheckError(cudaEventCreate(&amp;start));\n        cudaCheckError(cudaEventCreate(&amp;stop));\n\n        cudaEventRecord(start);\n        dummyKernel&lt;&lt;&lt;size \/ blockSize, blockSize>>>(data, out, size);\n        cudaEventRecord(stop);\n\n        cudaCheckError(cudaEventSynchronize(stop));\n        float elapsedTime = 0.0f;\n        cudaEventElapsedTime(&amp;elapsedTime, start, stop);\n\n        std::cout &lt;&lt; \"Block size: \" &lt;&lt; blockSize &lt;&lt; \", elapsed time: \" &lt;&lt;\n            elapsedTime &lt;&lt; \" ms\" &lt;&lt; std::endl;\n    }\n\n    cudaCheckError(cudaFree(data));\n    cudaCheckError(cudaFree(out));\n}<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Results<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"605\" height=\"340\" src=\"https:\/\/new.nestlogic.com\/wp-content\/uploads\/2019\/02\/image-2.png\" alt=\"\" class=\"wp-image-121\" srcset=\"https:\/\/nestlogic.com\/wp-content\/uploads\/2019\/02\/image-2.png 605w, https:\/\/nestlogic.com\/wp-content\/uploads\/2019\/02\/image-2-300x169.png 300w\" sizes=\"auto, (max-width: 605px) 100vw, 605px\" \/><\/figure>\n\n\n\n<p> <br>As we can see, starting from the block size of 32 the execution time graph is saturated, and  there are no performance improvements. Even though we are moving towards the maximum number of threads per block.<\/p>\n\n\n\n<p>It is noticeable that the saturation point is a size of the warp. As I do not use shared memory for this particular toy kernel, there is no performance improvement with further block size growth.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Simply do not forget that your block size should be <em>at least<\/em> equal to the warp size. If for some reason you use shared memory in your kernel, maximizing the block size seems to be the best solution for using all of the advantages of fast shared memory.<\/p>\n","protected":false},"excerpt":{"rendered":"Recently I was working on algorithm implementation using NVIDIA CUDA. For testing reasons I was using a tiny-toy data sample to check whether the algorithm worked as expected. I was&#8230;","protected":false},"author":2,"featured_media":170,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[23,28],"class_list":["post-82","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-cuda","tag-nvidia"],"acf":[],"_links":{"self":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts\/82","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/comments?post=82"}],"version-history":[{"count":9,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts\/82\/revisions"}],"predecessor-version":[{"id":134,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts\/82\/revisions\/134"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/media\/170"}],"wp:attachment":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/media?parent=82"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/categories?post=82"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/tags?post=82"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}