{"id":135,"date":"2019-02-11T13:18:47","date_gmt":"2019-02-11T13:18:47","guid":{"rendered":"https:\/\/new.nestlogic.com\/?p=135"},"modified":"2019-02-22T14:47:16","modified_gmt":"2019-02-22T14:47:16","slug":"gpu-utilization-with-neural-networks","status":"publish","type":"post","link":"https:\/\/nestlogic.com\/index.php\/2019\/02\/11\/gpu-utilization-with-neural-networks\/","title":{"rendered":"GPU  utilization with neural networks"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Abstract<\/h2>\n\n\n\n<p> Nowadays GPUs are widely used for neural networks training and inference. It\u2019s clear that GPUs are faster than CPU, but how much and do they do their best on such tasks. In this article we\u2019re testing performance of the basic neural network training operation\u2014matrix-vector multiplication using basic and kind of top GPUs, AWS p2.xlarge instance concretely, to see whether they are doing well in such operations. We compare different approaches, such as usage of ready-to-use computation frameworks such as TensorFlow, cuBLAS as well as handwritten code using CUDA. We\u2019re figured out that on basic hardware, such as NVIDIA GeForce 840M installed on my laptop, speedup is not so significant compared to CPU, but NVIDIA K80 card gives quite a good speedup. However there was found that GPU computational facilities is not fully exploited on such operations and resulting performance is not even close to the maximum. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"> Intro<\/h2>\n\n\n\n<p>Recently we decided to try RNN in our project and started investigation in this direction. I used high-level machine learning framework Keras for this purposes. After first try, even though I use GPU for training (TensorFlow backend for Keras), it was quite a long time to fit the network for dozens of epochs on quite a tiny dataset. All this computations was done on my GPU-enabled laptop with NVIDIA GeForce 840M card\u2014not a best choice, but convenient for the first try.<\/p>\n\n\n\n<p>Recently we decided to try RNN in our project and started investigation in this direction. I used high-level machine learning framework Keras for this purposes. After first try, even though I use GPU for training (TensorFlow backend for Keras), it was quite a long time to fit the network for dozens of epochs on quite a tiny dataset. All this computations was done on my GPU-enabled laptop with NVIDIA GeForce 840M card\u2014not a best choice, but convenient for the first try.<\/p>\n\n\n\n<p>Then I tried to fit the network with the same architecture and dataset on Amazon\u2019s p2.xlarge instance with NVIDIA K80 on-board. It was three times faster, but it\u2019s only three times.<\/p>\n\n\n\n<p>So, the natural question was: if we\u2019re planning to exploit such networks on big amounts of data, what is the plan to scale this process in terms of time and expenses? How to decide, what\u2019s cheaper and more effective for us: cluster of AWS instances m4.16xlarge with 64 vCPUs, or one instance p2.16xlarge with 16 GPUs on-board.<\/p>\n\n\n\n<p>So, the plan is to measure performance in Flops on matrix by vector multiplication\u2014basic operations in neural network fitting and prediction process.<\/p>\n\n\n\n<p>Our goal is to measure if utilization of GPU hardware is close to the optimal.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Measurements<\/h2>\n\n\n\n<p>First of all I wrote a set of little programs to measure performance in Gflops for such a basic neural network operation. So, here\u2019s the set of tests for matrix by vector multiplication I used:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>cuBLAS\u2019s cublasSgemv() function call;<\/li><li>CUDA simple kernel;<\/li><li>CPU implementation in a single thread;<\/li><li>CPU implementation in multiple threads;<\/li><li>Implementation using TensorFlow and Python;<\/li><li>Python + Numpy.<\/li><\/ul>\n\n\n\n<p>The test implementation takes one parameter, size, make makes the operation of multiplication of matrix \u211d<sup>N\u00d7N<\/sup> by vector \u211d<sup>N<\/sup>. I expect 2\u22c5N<sup>2<\/sup> floating point operations for this task\u2014N<sup>2<\/sup> multiplications and N<sup>2<\/sup> additions.<\/p>\n\n\n\n<p>Performance measure for cuBLAS and CUDA were done by asynchronous putting kernels into a stream, start and end of execution were marked with CUDA events to make precise evaluation of time that was spent on kernels execution.<\/p>\n\n\n\n<p>Here\u2019s the graphs I get running these tests on my laptop and on p2.xlarge AWS instance. They show dependence between N\u2014the size, mentioned in the paragraph above,\u2014and achieved performance in Gflops. Also random fill operations and memory transfers are not included into a benchmark, so it\u2019s just a pure computation time.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"http:\/\/www.nestlogic.com\/wp-content\/uploads\/2017\/04\/plot.thinkpad_yoga15-1.png\"><img decoding=\"async\" src=\"http:\/\/www.nestlogic.com\/wp-content\/uploads\/2017\/04\/plot.thinkpad_yoga15-1-1024x576.png\" alt=\"ThinkPad Yoga 15 with NVIDIA GeForce 840M onboard\" class=\"wp-image-2569\"\/><\/a><figcaption> ThinkPad Yoga 15 with NVIDIA GeForce 840M onboard <\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"http:\/\/www.nestlogic.com\/wp-content\/uploads\/2017\/04\/plot.p2.xlarge-1.png\"><img decoding=\"async\" src=\"http:\/\/www.nestlogic.com\/wp-content\/uploads\/2017\/04\/plot.p2.xlarge-1-1024x576.png\" alt=\"AWS p2.xlarge instance with NVIDIA K80 onboard\" class=\"wp-image-2567\"\/><\/a><figcaption> AWS p2.xlarge instance with NVIDIA K80 onboard <\/figcaption><\/figure>\n\n\n\n<p>I faced problems running TensorFlow on AWS instance, so it\u2019s omitted on the graph.<\/p>\n\n\n\n<p>Also it looks like CUDA implementation performance was not saturated, so I decided to test where it\u2019s saturated, running just GPU test for this purposes, not to spend a lot of time:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"http:\/\/www.nestlogic.com\/wp-content\/uploads\/2017\/04\/plot.p2.xlarge_saturated-1.png\"><img decoding=\"async\" src=\"http:\/\/www.nestlogic.com\/wp-content\/uploads\/2017\/04\/plot.p2.xlarge_saturated-1-1024x576.png\" alt=\"AWS p2.xlarge instance GPU tasks saturation\" class=\"wp-image-2568\"\/><\/a><figcaption> AWS p2.xlarge instance GPU tasks saturation <\/figcaption><\/figure>\n\n\n\n<p> Strange things about these graphs is that Numpy dot product on AWS instance is much slower than single-threaded implementation. I did not dig too much in this direction to explain why. <\/p>\n\n\n\n<p style=\"text-align:center\"> Performance in Gflops for N=16384 for my laptop and AWS p2.xlarge instance. We see, that there\u2019s only 1% of declared higher performance for both GPUs reached. <\/p>\n\n\n\n<table class=\"wp-block-table\"><tbody><tr><th>Test name<\/th><th>ThinkPad Yoga 15 with NVIDIA GeForce 840M onboard, Gflops<\/th><th>AWS p2.xlarge instance with NVIDIA K80 onboard, Gflops<\/th><\/tr><tr><td>cuBLAS&#8217;s cublasSgemv<\/td><td>7.65234<\/td><td>76.4771<\/td><\/tr><tr><td>CUDA kernel<\/td><td>7.59958<\/td><td>52.6853<\/td><\/tr><tr><td>CPU single thread<\/td><td>1.70727<\/td><td>3.60644<\/td><\/tr><tr><td>CPU multiple threads<\/td><td>5.24184<\/td><td>8.70136<\/td><\/tr><tr><td>Numpy dot product<\/td><td>1.50101<\/td><td>0.12011<\/td><\/tr><tr><td>TensorFlow multiplication<\/td><td>5.12266<\/td><td>&#8212;<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p>What we see in general is that there\u2019s no huge performance boost running matrix vector multiplications on GPU against CPU, but still significant\u2014\u223d10 times faster on AWS instance and \u223d1.5 times on my laptop (looks weird either).<\/p>\n\n\n\n<p>After all I compared the maximum performance with the theoretical maximum performance for both cards, and there\u2019s huge gap between what we\u2019ve got and what we see. For NVIDIA K80 theoretical performance is 5591\u20138736 Gflops, for NVIDIA GeForce 840M is 790.3 Gflops.<\/p>\n\n\n\n<p>So, it\u2019s discouraged me in some way, even though I understand that these are theoretical maximums, but real world picture gives performance drop nearly 100 times, it\u2019s 1% of declared maximum.<\/p>\n\n\n\n<p>This time I decided to write a dummy CUDA kernel with a lot of computations and minimum reads and writes with respect to the number of computational operations. It gave me the numbers I was looking for:<\/p>\n\n\n\n<p style=\"text-align:center\">\n\nMaximum achieved Gflops\n\n<\/p>\n\n\n\n<table class=\"wp-block-table\"><tbody><tr><th>Device<\/th><th>Achieved performance, Gflops<\/th><th>Theoretical maximum, Gflops<\/th><\/tr><tr><td>NVIDIA K80<\/td><td>3982.4<\/td><td>5591\u20138736<\/td><\/tr><tr><td>NVIDIA GeForce 840M<\/td><td>793.187<\/td><td>790.3<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p>Looks much better, but it means that we can achieve such performance only by huge amount of operations with respect to the number of reads and writes from\/to global device memory.<\/p>\n\n\n\n<p>Also note that on AWS p2.xlarge instance we have only a half of NVIDIA K80, so the number in table is quite a natural.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n\n\n\n<p>We suspect that matrix-vector multiplication operations are limited not by computational power of GPU, but much more by memory latency. It looks like usage GPUs for machine learning purposes is not the ideal solution, but the best available nowadays. Also looks like we were not the first who faced this problem, and we found that Google came up with theirs own <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tensor_processing_unit\">Tensor Processing Unit<\/a>. It\u2019s a special hardware for machine learning tasks with more power and computational efficiency. More technical details can be found <a href=\"https:\/\/drive.google.com\/file\/d\/0Bx4hafXDDq2EMzRNcy1vSUxtcEk\/view\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"Nowadays GPUs are widely used for neural networks training and inference. It\u2019s clear that GPUs are faster than CPU, but how much and do they do their best on such tasks.","protected":false},"author":2,"featured_media":169,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[29,30],"class_list":["post-135","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-gpu","tag-neural-networks"],"acf":[],"_links":{"self":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts\/135","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/comments?post=135"}],"version-history":[{"count":2,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts\/135\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/posts\/135\/revisions\/149"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/media\/169"}],"wp:attachment":[{"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/media?parent=135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/categories?post=135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nestlogic.com\/index.php\/wp-json\/wp\/v2\/tags?post=135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}