{"id":1135,"date":"2012-02-28T16:49:18","date_gmt":"2012-02-28T21:49:18","guid":{"rendered":"http:\/\/codinggorilla.domemtech.com\/?p=1135"},"modified":"2012-03-03T10:09:28","modified_gmt":"2012-03-03T15:09:28","slug":"performance-comparison-of-cuda-opencl-and-c-amp","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2012\/02\/28\/performance-comparison-of-cuda-opencl-and-c-amp\/","title":{"rendered":"Performance comparison of CUDA, OpenCL, and C++ AMP"},"content":{"rendered":"<p style=\"text-align: justify;\">Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult. \u00c2\u00a0Companies will not publish design information because they do not want you or other companies to copy the technology. \u00c2\u00a0But, sometimes you need to know details of a technology that are just not published in order to use it effectively. \u00c2\u00a0If they won&#8217;t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2]. \u00c2\u00a0What is the performance of OpenCL, CUDA, and C++ AMP? \u00c2\u00a0What can we learn from this information?<\/p>\n<p><!--more--><\/p>\n<h1 style=\"text-align: justify;\">Introduction<\/h1>\n<p style=\"text-align: justify;\">Performance of OpenCL, CUDA, and C++ AMP is not easy to compare because there is no one-to-one correspondence between the APIs. \u00c2\u00a0For example, there is no direct analog of the OpenCL function clSetKernelArg within CUDA. \u00c2\u00a0In fact, clSetKernelArg can be called once and the kernel called multiple times. \u00c2\u00a0In CUDA, this is impossible because the kernel argument is set when the kernel is called: a kernel call is translated into internal functions cudaSetupArgument and cudaLaunch.<\/p>\n<p style=\"text-align: justify;\">Moreover, each GPGPU programming language environment performs initialization when a function of the API\u00c2\u00a0called, which is order dependent. For example, the first call to the CUDA API could be a cudaMalloc or cudaSetDevice depending on whether the programmer accepts the default device. \u00c2\u00a0However, when either function is called, a long wait follows because CUDA is initializing. \u00c2\u00a0But it is not hopeless if the developer implements the solution carefully.<\/p>\n<p style=\"text-align: justify;\">As it turns out, a block of code in a program written in one language can usually be translated into equivalent code in another GPGPU programming language. \u00c2\u00a0For example, code to find and select a device in CUDA has an equivalent block of code in OpenCL. \u00c2\u00a0These blocks of code are called phases. \u00c2\u00a0What are the phases of a program? \u00c2\u00a0This depends, but one possible definition follows:<\/p>\n<ol>\n<li>Setup resources on the GPU for problem solving, e.g., pick the GPU, allocate memory on the GPU, and compile kernel code for the GPU.<\/li>\n<li>Copy data structures from the CPU to the GPU.<\/li>\n<li>Call kernel: Run the kernel code to solve the problem on the GPU.<\/li>\n<li>Copy data structures\u00c2\u00a0from the GPU to the CPU.<\/li>\n<li>Release resources on the GPU.<\/li>\n<\/ol>\n<div>The functions of each language can be grouped into these phases:<\/div>\n<table width=\"100%\" border=\"1\" cellspacing=\"5\" cellpadding=\"5\">\n<tbody>\n<tr>\n<th><\/th>\n<th>CUDA<\/th>\n<th>OpenCL<\/th>\n<th>C++ AMP<\/th>\n<\/tr>\n<tr>\n<td>Setup<\/td>\n<td>cudaGetDeviceCount<br \/>\ncudaGetDeviceProperties<br \/>\ncudaSetDevice<br \/>\ncudaMalloc<\/td>\n<td>clGetPlatformIDs<br \/>\nclGetPlatformInfo<br \/>\nclGetDeviceIDs<br \/>\nclGetDeviceInfo<br \/>\nclCreateContext<br \/>\nclCreateProgramWithSource<br \/>\nclBuildProgram<br \/>\nclGetProgramBuildInfo<br \/>\nclCreateKernel<br \/>\nclCreateBuffer<br \/>\nclSetKernelArg<br \/>\nclCreateCommandQueue<\/td>\n<td>allocate vector&lt;<br \/>\naccelerator&gt;get_accelerators()<br \/>\nallocate\u00c2\u00a0vector::iterator<br \/>\naccelerators.begin()<br \/>\naccelerators.end()<br \/>\nget_description()<br \/>\nallocate array&lt;&gt;<br \/>\nallocate\u00c2\u00a0extent&lt;&gt;<br \/>\nallocate\u00c2\u00a0grid&lt;&gt;<\/td>\n<\/tr>\n<tr>\n<td>Copy to GPU<\/td>\n<td>cudaMemcpy<\/td>\n<td>clEnqueueWriteBuffer<\/td>\n<td>copy<\/td>\n<\/tr>\n<tr>\n<td>Call kernel<\/td>\n<td>kernel&lt;&lt;&lt;&#8230;&gt;&gt;&gt;(&#8230;)<br \/>\ncudaDeviceSynchronize<\/td>\n<td>clEnqueueNDRangeKernel<br \/>\nclWaitForEvents<\/td>\n<td>parallel_for_each<br \/>\nflush<br \/>\nwait<\/td>\n<\/tr>\n<tr>\n<td>Copy to CPU<\/td>\n<td>cudaMemcpy<\/td>\n<td>clEnqueueReadBuffer<\/td>\n<td>copy<\/td>\n<\/tr>\n<tr>\n<td>Release<\/td>\n<td>cudaFree<\/td>\n<td>clReleaseMemObject<br \/>\nclReleaseKernel<br \/>\nclReleaseContext<br \/>\nclReleaseProgram<\/td>\n<td>allocate grid&lt;&gt;<br \/>\nallocate\u00c2\u00a0extent&lt;&gt;<br \/>\nallocate\u00c2\u00a0array&lt;&gt;<br \/>\ndeallocate\u00c2\u00a0accelerator_view<br \/>\ndeallocate\u00c2\u00a0vector&lt;accelerator&gt;<br \/>\ndeallocate\u00c2\u00a0vector::iterator<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h1 style=\"text-align: justify;\">Gathering run-time data<\/h1>\n<p style=\"text-align: justify;\">How do we collect the run-times for each phase? \u00c2\u00a0There are a few ways to do this:<\/p>\n<ul style=\"text-align: justify;\">\n<li>Use a\u00c2\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Black-box_testing\" target=\"_blank\">black-box test<\/a>\u00c2\u00a0to gather the run-time for the process. \u00c2\u00a0To do this, run a program within the Bourne shell using the command &#8220;time&#8221; to display the run-time. \u00c2\u00a0Unfortunately, this will not show the run-time for each phase.<\/li>\n<li>Use a\u00c2\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/White-box_testing\" target=\"_blank\">white-box test<\/a>\u00c2\u00a0using a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Profiling_(computer_programming)\" target=\"_blank\">profiler<\/a>\u00c2\u00a0to gather the run-time for phases. \u00c2\u00a0But, the programmer must partition the phases into functions because the profiler is phase unaware.<\/li>\n<li>Use a white-box test, implemented with a &#8220;self-instrumented&#8221; program\u00c2\u00a0(i.e., a program in which the programmer inserted calls to a clock function) to gather the run-time for the phases.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><span style=\"text-align: justify;\">To make a fair comparison between environments, the problem and the solution must be the same, using the same algorithm, grid\/tile size, memory in the GPU, etc.<\/span><\/p>\n<p><span style=\"text-align: justify;\">Let&#8217;s take a simple problem:<\/span><\/p>\n<p style=\"text-align: justify;\">input: A = array[0..n] of 1&#8217;s; integer u; integer i;<\/p>\n<p style=\"text-align: justify;\">output: O = array[0..n] of integers, where O[k] = u * i when(k % u) = 0, otherwise O[k] = 0.<\/p>\n<p style=\"text-align: justify;\">In other words, sum the number of 1&#8217;s in an array of <strong><em>n<\/em><\/strong> integers (all 1&#8217;s),\u00c2\u00a0<strong><em>i<\/em><\/strong> times, with a block size of <em><strong>u<\/strong><\/em>, and place the sum at the beginning of the block. \u00c2\u00a0For example, given an array of 16 integers (n = 16), u = 4, and i = 4, O[0] = O[4] = O[8] = O[12] = 16, and O[1] = O[2] = O[3] = O[5] = &#8230; = 0 (for all other indices). \u00c2\u00a0Other parameters to tweak the problem are: <strong><em>x<\/em><\/strong> iterations of solving the entire problem; stride <strong><em>s<\/em><\/strong> distance between elements summed.<\/p>\n<p style=\"text-align: justify;\">The solution of this problem is <a href=\"http:\/\/codinggorilla.domemtech.com\/code\/global-perf.new.zip\" target=\"_blank\">here<\/a>. \u00c2\u00a0The defaults for the problem are: &#8220;-u 4 -i 1 -x 1000 -u 8388608 -s 1&#8221;. \u00c2\u00a0The grid size is 256 threads.<\/p>\n<h1 style=\"text-align: justify;\">Test environment<\/h1>\n<p style=\"text-align: justify;\">This program was executed on two machines:<\/p>\n<ul>\n<li style=\"text-align: justify;\">Asus P5N-D motherboard, Intel Q6600 @ 2.51 GHz (overclocked),\u00c2\u00a04 GB DDR2-1066 @ 838 MHz, Windows 7 64-bit OS,\u00c2\u00a0NVIDIA\u00c2\u00a0GeForce GTX 470, an ATI Radeon HD 6450. A list of drivers and versions is\u00c2\u00a0<a href=\"http:\/\/codinggorilla.domemtech.com\/code\/report.html\" target=\"_blank\">here<\/a>.<\/li>\n<li style=\"text-align: justify;\">Gigabyte GA-A75-UD4H, AMD A8-3850 @ 2.9 GHz, 8 GB DDR3-1600 @ 800 MHz, Windows 8 64-bit OS (Developer Preview).<\/li>\n<\/ul>\n<h1>Results<\/h1>\n<h3>Setup time before copy to GPU<\/h3>\n<p style=\"text-align: justify;\">The time associated to compile an OpenCL program accounts for the large setup time for GPUs in the OpenCL environment. \u00c2\u00a0Other environments do not have this problem because the code is pre-compiled to an intermediate language.<\/p>\n<table>\n<tbody>\n<tr>\n<th>CUDA (MS)<\/th>\n<th>OPENCL (MS)<\/th>\n<th>C++ AMP (MS)<\/th>\n<\/tr>\n<tr>\n<td>38.7 \u00c2\u00b1 0.4<\/td>\n<td>424 \u00c2\u00b1 12<\/td>\n<td>76.8 \u00c2\u00b1 1.9<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Run-times on Intel machine, mean \u00c2\u00b1 S.E. of sample for 5 runs.)<\/p>\n<h3>Copying data between GPU and CPU<\/h3>\n<p>The run-time for copying data from the CPU to GPU are similar for CUDA, OpenCL, and C++ AMP. \u00c2\u00a0However, copying data from the GPU to CPU seems to be different, and slower on all platforms, with C++ AMP much slower.<\/p>\n<table>\n<tbody>\n<tr>\n<th>TO\/FROM GPU<\/th>\n<th>CUDA (MS)<\/th>\n<th>OPENCL (MS)<\/th>\n<th>C++ AMP (MS)<\/th>\n<\/tr>\n<tr>\n<td>To<\/td>\n<td>47.1 \u00c2\u00b1 1.4<\/td>\n<td>58.2 \u00c2\u00b1 5.7<\/td>\n<td>64.7 \u00c2\u00b1 1.1<\/td>\n<\/tr>\n<tr>\n<td>From<\/td>\n<td>72.5 \u00c2\u00b1\u00c2\u00a01.9<\/td>\n<td>74.2 \u00c2\u00b1\u00c2\u00a02.7<\/td>\n<td>121 \u00c2\u00b1 2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Run-times on Intel machine, mean \u00c2\u00b1 S.E. of sample for 5 runs.)<\/p>\n<h3>Kernel Overhead<\/h3>\n<p>On both machines, the run-time for a loop of a varying number of kernel calls in CUDA, OpenCL, and C++ AMP fairly similar, with C++ AMP slightly slower than the other two. \u00c2\u00a0However, for small problems, C++ AMP shows an overhead around 70 ms associated with the first call to the kernel, whereas the overhead is\u00c2\u00a0negligible\u00c2\u00a0for the other two environments.<\/p>\n<table>\n<tbody>\n<tr>\n<th>PROBLEM PARAMETERS<\/th>\n<th>CUDA (MS)<\/th>\n<th>OPENCL (MS)<\/th>\n<th>C++ AMP (MS)<\/th>\n<\/tr>\n<tr>\n<td>-i 1 -x 1<\/td>\n<td>0.936 \u00c2\u00b1 0.005<\/td>\n<td>0.944 \u00c2\u00b1 0.017<\/td>\n<td>71.6 \u00c2\u00b1 3.7<\/td>\n<\/tr>\n<tr>\n<td>-i 1 -x 10<\/td>\n<td>6.34 \u00c2\u00b1 0.01<\/td>\n<td>6.46 \u00c2\u00b1\u00c2\u00a00.01<\/td>\n<td>76.35 \u00c2\u00b1\u00c2\u00a00.61<\/td>\n<\/tr>\n<tr>\n<td>-i 1 -x 100<\/td>\n<td>60.5 \u00c2\u00b1 0.1<\/td>\n<td>60.5 \u00c2\u00b1\u00c2\u00a00.1<\/td>\n<td>145 \u00c2\u00b1\u00c2\u00a01<\/td>\n<\/tr>\n<tr>\n<td>-i 1 -x 1000<\/td>\n<td>602 \u00c2\u00b1 0.1<\/td>\n<td>601 \u00c2\u00b1\u00c2\u00a00.3<\/td>\n<td>713 \u00c2\u00b1\u00c2\u00a01<\/td>\n<\/tr>\n<tr>\n<td>-i 1 -x 10000<\/td>\n<td>6022 \u00c2\u00b1 0.4<\/td>\n<td>6005 \u00c2\u00b1\u00c2\u00a01<\/td>\n<td>6378 \u00c2\u00b1\u00c2\u00a03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Run-times on Intel machine,\u00c2\u00a0mean \u00c2\u00b1 S.E. of sample for 5 runs.)<\/p>\n<h3>Object code overhead<\/h3>\n<p style=\"text-align: justify;\">On both machines, the run-time for one kernel call containing a loop of a varying number of iterations in CUDA, OpenCL, and C++ AMP, are again fairly similar. \u00c2\u00a0CUDA and OpenCL perform the kernel call with a loop of 10000 iterations around 2.2 s, while C++ AMP is slightly slower than the other two at 3.7 s. \u00c2\u00a0These data suggest that performance of the object code for C++ AMP is not as good as for OpenCL or CUDA.<\/p>\n<table>\n<tbody>\n<tr>\n<th>PROBLEM PARAMETERS<\/th>\n<th>CUDA (MS)<\/th>\n<th>OPENCL (MS)<\/th>\n<th>C++ AMP (MS)<\/th>\n<\/tr>\n<tr>\n<td>-i 1 -x 1<\/td>\n<td>0.918 \u00c2\u00b1 0.010<\/td>\n<td>0.946 \u00c2\u00b1 0.010<\/td>\n<td>71.6 \u00c2\u00b1 3.7<\/td>\n<\/tr>\n<tr>\n<td>-i 10 -x 10<\/td>\n<td>2.87 \u00c2\u00b1\u00c2\u00a00.01<\/td>\n<td>2.87 \u00c2\u00b1\u00c2\u00a00.02<\/td>\n<td>70.7 \u00c2\u00b1 0.5<\/td>\n<\/tr>\n<tr>\n<td>-i 100 -x 100<\/td>\n<td>22.5 \u00c2\u00b1\u00c2\u00a00.02<\/td>\n<td>22.3 \u00c2\u00b1\u00c2\u00a00.03<\/td>\n<td>103 \u00c2\u00b1 0.3<\/td>\n<\/tr>\n<tr>\n<td>-i 1000 -x 1000<\/td>\n<td>220 \u00c2\u00b1\u00c2\u00a00.1<\/td>\n<td>219 \u00c2\u00b1\u00c2\u00a00.1<\/td>\n<td>428 \u00c2\u00b1 1<\/td>\n<\/tr>\n<tr>\n<td>-i 10000 -x 10000<\/td>\n<td>2195 \u00c2\u00b1 0.4<\/td>\n<td>2188 \u00c2\u00b1\u00c2\u00a00.1<\/td>\n<td>3671 \u00c2\u00b1\u00c2\u00a01<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Run-times on Intel machine, mean \u00c2\u00b1 S.E. of sample for 5 runs.)<\/p>\n<h3>Performance comparison between NVIDIA 470 and AMD Llano APU<\/h3>\n<p style=\"text-align: justify;\">The relative performance for the GPU&#8217;s is NVIDIA 470 &gt;\u00c2\u00a0AMD Llano APU &gt;&gt; ATI Radeon HD 6450. \u00c2\u00a0There is no surprise there, considering their cost.<\/p>\n<table>\n<tbody>\n<tr>\n<th>NVIDIA 470 (S)<\/th>\n<th>AMD Llano GPU (S)<\/th>\n<th>ATI Radeon HD 6450 (S)<\/th>\n<\/tr>\n<tr>\n<td>1.41 \u00c2\u00b1 0.01<\/td>\n<td>8.62 \u00c2\u00b1 0.02<\/td>\n<td>37.1 \u00c2\u00b1 0.01<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Run-times on Intel and AMD machines. \u00c2\u00a0Program ran with standard parameters, and &#8220;-l opencl -q&#8221;. \u00c2\u00a0Mean \u00c2\u00b1 S.E. of sample for 5 runs.)<\/p>\n<h1>Discussion and conclusion<\/h1>\n<p style=\"text-align: justify;\">In order to program efficient code in GPU programming environments, developers need to understand the API. Program design requires information that may not be published, or at best, in developer forums (e.g., http:\/\/forums.nvidia.com and http:\/\/forums.amd.com).\u00c2\u00a0Experimentation helps identify the information needed by the programmer for good design.<\/p>\n<h1>References<\/h1>\n<p style=\"text-align: justify;\">1) Wong, H., M. Papadopoulou, et al. (2010). Demystifying GPU microarchitecture through microbenchmarking. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2010, IEEE.<\/p>\n<p style=\"text-align: justify;\">2) Zhang, Y., L. Peng, et al. &#8220;Architecture Comparisons between Nvidia and ATI GPUs: Computation Parallelism and Data Communications.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult. \u00c2\u00a0Companies will not publish design information because they do not want you or other companies to copy the technology. \u00c2\u00a0But, sometimes you need to know details of a technology that are just not published in order &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2012\/02\/28\/performance-comparison-of-cuda-opencl-and-c-amp\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Performance comparison of CUDA, OpenCL, and C++ AMP&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1135"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=1135"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1135\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=1135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=1135"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=1135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}