{"id":669,"date":"2011-07-28T09:49:28","date_gmt":"2011-07-28T16:49:28","guid":{"rendered":"http:\/\/domemtech.com\/?p=669"},"modified":"2011-08-05T04:12:19","modified_gmt":"2011-08-05T11:12:19","slug":"opencl-vs-cuda","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2011\/07\/28\/opencl-vs-cuda\/","title":{"rendered":"OpenCL vs. CUDA"},"content":{"rendered":"<p style=\"text-align: justify;\">Today&#39;s processors have undergone a huge transformation from those of just&nbsp;10 years ago. &nbsp;CPU manufacturers Intel and AMD (and up and coming CPU designer ARM) have increased processor speed via greater emphasis on superscalar execution, deeper pipelining, branch prediction, out of order execution, and fast multi-level caches. &nbsp;This design philosophy has resulted in faster response time for single tasks executing on a processor, but at the expense of increased circuit complexity, high power consumption, and a small number of cores on the die. &nbsp;On the other hand, GPU manufacturers NVIDIA and ATI have&nbsp;focused their designs on processors with many simple cores that implement SIMD parallelism, which hides latency of instruction execution <a href=\"#REF1\">[1]<\/a>.<\/p>\n<p style=\"text-align: justify;\">While GPUs have been in existence for about 10&nbsp;years, the software support for these processor have taken&nbsp;years to catch up.&nbsp; Software developers are still sifting through solutions for programming these processors. &nbsp;OpenCL and CUDA are frameworks for GPGPU computing. &nbsp;Each framework comprises a language for expressing kernel code (instructions that run on a GPU), and an API for calling kernels (from the CPU). &nbsp;While the frameworks are similar, there are some important differences.<\/p>\n<div style=\"text-align: justify;\">CUDA is a proprietary framework. It is not open source, and all changes to the language and API are made by NVIDIA. But, some third-party tools have been built around the framework and it does seem to have a large following in academia. &nbsp;Unfortunately, CUDA only runs on NVIDIA devices. &nbsp;While it should be possible to run CUDA code on other platforms using <a href=\"http:\/\/code.google.com\/p\/gpuocelot\/\">Ocelot<\/a>, this only works on Linux systems.<\/div>\n<div>&nbsp;<\/div>\n<div style=\"text-align: justify;\">OpenCL is a standardized framework, and&nbsp;is starting to gain&nbsp;popularity.&nbsp; Similar to NVIDIA&#39;s CUDA C++, OpenCL allows programmers to use the massive parallel computing power of&nbsp;GPU&#39;s for general purpose computing.&nbsp;&nbsp;Unlike CUDA, OpenCL works on any supported&nbsp;GPU or CPU,&nbsp;including Intel, AMD, NVIDIA, IBM, and&nbsp;ARM processors.&nbsp;<\/div>\n<div style=\"text-align: justify;\">&nbsp;<\/div>\n<div style=\"text-align: justify;\">Does OpenCL make programming multiple platforms easier? &nbsp;Is it as fast as CUDA, or does it sacrafice speed for diverse platform support?<\/div>\n<p><!--more--><\/p>\n<p style=\"text-align: justify;\">To answer these questions, I decided to&nbsp;solve a problem using CUDA (both the Driver API and the Runtime API) and OpenCL, and then compare the solutions with regards to the&nbsp;language, API,&nbsp;development tools, and the frequency and stability of the release of the SDK&#39;s.<\/p>\n<h2 style=\"text-align: justify;\">Methods<\/h2>\n<p style=\"text-align: justify;\">The problem involves estimating the value of&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Pi\"><span class=\"texhtml\" style=\"line-height: 1em; font-family: &quot;Times New Roman&quot;, serif; font-size: 15px; white-space: nowrap;\">&pi;<\/span><\/a>&nbsp;using&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Numerical_integration\">numerical integration<\/a> of a <a href=\"http:\/\/www.proofwiki.org\/wiki\/Leibniz's_Formula_for_Pi\/Elementary_Proof\">well-known formula<\/a>&nbsp;you probably have seen in your college&nbsp;calculus course:<\/p>\n<h3 style=\"text-align: justify;\">Figure 1. Definition of&nbsp;<span class=\"texhtml\" style=\"line-height: 1em; font-family: &quot;Times New Roman&quot;, serif; font-size: 15px; white-space: nowrap;\">&pi;<\/span>.<\/h3>\n<p>&nbsp;<\/p>\n<p><script type=\"text\/javascript\"src=\"http:\/\/cdn.mathjax.org\/mathjax\/latest\/MathJax.js?config=TeX-AMS-MML_HTMLorMML\"><\/script><\/p>\n<p><math display=\"block\" xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>&pi;<\/mi> <mo>=<\/mo> <mrow> <msubsup> <mo>&int;<\/mo> <mrow> <mn>0<\/mn> <\/mrow> <mrow> <mn>1<\/mn> <\/mrow> <\/msubsup> <\/mrow> <mfrac> <mrow> <mrow> <mn>4<\/mn> <\/mrow> <\/mrow> <mrow> <mn>1<\/mn> <mo>+<\/mo> <msup> <mi>t<\/mi> <mn>2<\/mn> <\/msup> <\/mrow> <\/mfrac> <mrow> <mi mathvariant=\"normal\">d<\/mi> <mi>t<\/mi><\/mrow> <\/math>\n<\/p>\n<p>This formula can be solved a number of ways, including <a href=\"http:\/\/en.wikipedia.org\/wiki\/Trapezoidal_rule\">composite trapezoid rule<\/a>, <a href=\"http:\/\/highered.mcgraw-hill.com\/sites\/0073401064\/information_center_view0\/\">Runge-Kutta (where it is used to approximate Simpson&#39;s Rule)<\/a>, and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Monte_Carlo_integration\">Monte Carlo<\/a>.&nbsp; However, for this problem, I used the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Simpson's_rule\">Composite Simpson&#39;s Rule<\/a>:<\/p>\n<h3 style=\"text-align: justify;\">Figure 2. Composite Simpson&#39;s Rule.<\/h3>\n<p>&nbsp;<\/p>\n<p><script type=\"text\/javascript\"src=\"http:\/\/cdn.mathjax.org\/mathjax\/latest\/MathJax.js?config=TeX-AMS-MML_HTMLorMML\"><\/script><\/p>\n<p><math display=\"block\" xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"> <semantics> <mrow> <msubsup> <mo>&int;<\/mo> <mi>a<\/mi> <mi>b<\/mi> <\/msubsup> <mi>f<\/mi> <mrow> <mo>(<\/mo> <mi>x<\/mi> <mo>)<\/mo> <\/mrow> <mi>dx<\/mi> <mi>&asymp;<\/mi> <mrow> <mi>h<\/mi> <mo>\/<\/mo> <mn>3<\/mn> <\/mrow> <mrow> <mo>[<\/mo> <mrow> <mi>f<\/mi> <mrow> <mrow> <mo>(<\/mo> <msub> <mi>x<\/mi> <mn>0<\/mn> <\/msub> <mo>)<\/mo> <\/mrow> <mo>+<\/mo> <mn>2<\/mn> <\/mrow> <munderover> <mi>&sum;<\/mi> <mrow> <mrow> <mi>i<\/mi> <mo>=<\/mo> <mn>1<\/mn> <\/mrow> <\/mrow> <mrow> <mrow> <mrow> <mi>n<\/mi> <mo>\/<\/mo> <mn>2<\/mn> <\/mrow> <mo>&minus;<\/mo> <mn>1<\/mn> <\/mrow> <\/mrow> <\/munderover> <mi>f<\/mi> <mrow> <mrow> <mo>(<\/mo> <msub> <mi>x<\/mi> <mn>2i<\/mn> <\/msub> <mo>)<\/mo> <\/mrow> <mo>+<\/mo> <mn>4<\/mn> <\/mrow> <munderover> <mi>&sum;<\/mi> <mrow> <mrow> <mi>i<\/mi> <mo>=<\/mo> <mn>1<\/mn> <\/mrow> <\/mrow> <mrow> <mrow> <mi>n<\/mi> <mo>\/<\/mo> <mn>2<\/mn> <\/mrow> <\/mrow> <\/munderover> <mi>f<\/mi> <mrow> <mo>(<\/mo> <msub> <mi>x<\/mi> <mrow> <mrow> <mn>2i<\/mn> <mo>&minus;<\/mo> <mn>1<\/mn> <\/mrow> <\/mrow> <\/msub> <mo>)<\/mo> <mo>+<\/mo> <\/mrow> <mi>f<\/mi> <mrow> <mo s=\"\">(<\/mo> <msub> <mi>x<\/mi> <mi>n<\/mi> <\/msub> <mo>)<\/mo> <\/mrow> <\/mrow> <mo>]<\/mo> <\/mrow> <\/mrow> <annotation encoding=\"StarMath 5.0\"> &int;_a^b f(x)dx&asymp;h\/3 [f(x_0 )+2&sum;_(i=1)^(n\/2-1) f(x_2i ) +4&sum;_(i=1)^(n\/2) f(x_(2i-1) ) f(x_n )] <\/annotation> <\/semantics> <\/math>\n<\/p>\n<p style=\"text-align: justify;\">Implementation of the CUDA and OpenCL solutions of the Composite Simpson&#39;s Rule is not hard. &nbsp; The computaton of the r.h.s. of the equation in Figure 2 can be done in three steps, each of which can be parallelized and computed on a GPU.<\/p>\n<ol>\n<li style=\"text-align: justify;\">Initialize an array of floating points a[i], where i = 0..n. &nbsp;a[i]&nbsp;= h * i where h = (1 &#8211; 0)\/n = 1\/n. &nbsp;The array a[i] is the value of <em>x<sub>i<\/sub><\/em> in the r.h.s. of the equation in Figure 2, from 0 to 1.<\/li>\n<li style=\"text-align: justify;\">Compute array of floating points fa[i], where i = 0..n.&nbsp; fa[i] = f(a[i]) where f(p) = c(i) * 1\/(1+p<sup>2<\/sup>). The coefficients c(i) = [1, 4, 2, 4, 2, 4, &#8230;, 1]. &nbsp;This computes the values of each term within the large braces in Figure 2.<\/li>\n<li style=\"text-align: justify;\">Compute s = &sum; fa[i], where i = 0..n. This computes the sum of the terms in Figure 2.<\/li>\n<\/ol>\n<p style=\"text-align: justify;\">After computing Step 3, the approximate value of <span class=\"texhtml\" style=\"line-height: 1em; font-family: &quot;Times New Roman&quot;, serif; font-size: 15px; white-space: nowrap;\">&pi;<\/span>&nbsp;is s * h \/ 3.<\/p>\n<p style=\"text-align: justify;\">Let&#39;s see how the calculation would work for n = 7. &nbsp;In Step 1, x[] = [&nbsp;0, 0.142857, 0.285714, 0.428571, 0.571429, 0.714286, 0.857143, 1]. &nbsp;In Step 2, fa[] = [4, 15.68, 7.396226, 13.51724, 6.030769, 10.59459, 4.611765, 2]. Finally, in Step 3, the sum of fa[] is&nbsp;63.8306. &nbsp;<span class=\"texhtml\" style=\"line-height: 1em; font-family: &quot;Times New Roman&quot;, serif; font-size: 15px; white-space: nowrap;\">&pi;<\/span>&nbsp;&asymp; h\/3 *&nbsp;63.8306 =&nbsp;63.8306 \/ 3 \/ 7 =&nbsp;3.039552 &asymp; 3.141592653589793&#8230;<\/p>\n<p style=\"text-align: justify;\">Steps 1 and 2 are <a href=\"http:\/\/en.wikipedia.org\/wiki\/Map_(higher-order_function)\">MAP<\/a>&nbsp;operations. For each implementation the thread and block identification (or local and group identification in OpenCL terminology), are converted into a&nbsp;contiguous range of&nbsp;integers&nbsp;within 0 .. n.&nbsp;&nbsp;<\/p>\n<p style=\"text-align: justify;\">Step 3 is a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Reduce_(higher-order_function)\">REDUCE<\/a> operation, where a binary operator is applied to an array. &nbsp;Reduce(op, fa[]) is a well-known problem, and a naive PRAM-like solution for the GPU is&nbsp;shown below (Figure 3). &nbsp;This algorithm uses global memory, modifying the input array while computing the sum. &nbsp;The result of Reduce(+, fa[i]) is contained in fa[0] at the end of the computation. The size of the array, which is n + 1, should be a power of 2. &nbsp;If is is not, then it can be padded with zeros to a size that is.<\/p>\n<div>&nbsp;<\/div>\n<table cellpadding=\"0\" cellspacing=\"0\" style=\"border: 1px dotted rgb(211, 211, 211);\" width=\"100%\">\n<tbody>\n<tr>\n<td style=\"border: 1px dotted rgb(211, 211, 211);\">\n<div>\n<h3>Figure 3. GPU Pseudo code for Reduce, using global memory<\/h3>\n<p><em>Reduce<\/em>(&oplus;, f<em>a<\/em>[0..<em>m<\/em>-1])<em>&nbsp;<\/em>=<\/p>\n<p style=\"margin-left: 40px;\"><em>levels<\/em>&nbsp;=&nbsp;<em>log<sub>2<\/sub>&nbsp;<\/em><em>m<\/em><\/p>\n<p style=\"margin-left: 40px;\"><strong>for<\/strong>&nbsp;(<em>level&nbsp;<\/em>= 1;<em>&nbsp;level&nbsp;<\/em>&le;<em>&nbsp;levels<\/em>;<em>&nbsp;level&nbsp;<\/em>+= 1)<\/p>\n<p style=\"margin-left: 40px;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>step<\/em>&nbsp;= 2<em><sup>level<\/sup><\/em><\/p>\n<p style=\"margin-left: 40px;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>linear_to_tile<\/em>(<em>n&nbsp;<\/em>\/<em>&nbsp;step,&nbsp;<\/em>&amp;<em>tiles,&nbsp;<\/em>&amp;<em>tile_size<\/em>)<\/p>\n<p style=\"margin-left: 40px;\"><strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parallel tile for&nbsp;<\/strong>(<em>d<\/em>&nbsp;=&nbsp;<em>tiles<\/em>,&nbsp;<em>t<\/em>&nbsp;=&nbsp;<em>tile_size<\/em>)<\/p>\n<p style=\"margin-left: 40px;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int&nbsp;<em>p<\/em>&nbsp;=&nbsp;<em>tile_to_linear<\/em>(<em>d<\/em>,&nbsp;<em>t<\/em>)<\/p>\n<p style=\"margin-left: 40px;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int&nbsp;<em>i<\/em>&nbsp;=&nbsp;<em>step<\/em>&nbsp;*&nbsp;<em>p<\/em><\/p>\n<p style=\"margin-left: 40px;\">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; f<em>a<\/em>[<em>i<\/em>] = f<em>a<\/em>[<em>i<\/em>] &oplus;&nbsp;<em>a<\/em>[<em>i<\/em>&nbsp;+&nbsp;<em>step&nbsp;<\/em>\/ 2]<\/p>\n<p style=\"margin-left: 40px;\">return [0]<\/p>\n<\/p><\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">The solutions for the problem using CUDA Runtime API, CUDA Driver API, and OpenCL are shown in Figure 4, 5, and 6, respectively.<\/p>\n<h3>Figure&nbsp;4. Implementation of Composite Simpson&#39;s Rule for &pi;&nbsp;using the CUDA Runtime API.<\/h3>\n<pre lang=\"c\">#include &lt;stdio.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;iomanip&gt;\r\n\r\nvoid make_results(int * blocks, int * threads, int max_dimensionality, dim3 * tile_size, dim3 * tiles)\r\n{\r\n    tiles-&gt;x = blocks[0];\r\n    tiles-&gt;y = blocks[1];\r\n    tiles-&gt;z = blocks[2];\r\n    tile_size-&gt;x = threads[0];\r\n    tile_size-&gt;y = threads[1];\r\n    tile_size-&gt;z = threads[2];\r\n}\r\n\r\nvoid l2t(int size, dim3 * tile_size, dim3 * tiles)\r\n{\r\n    int max_dimensionality = 3;\r\n    int blocks[10];\r\n    for (int j = 0; j &lt; max_dimensionality; ++j)\r\n        blocks[j] = 1;\r\n    int max_threads[] = { 512, 64, 64};\r\n    int max_blocks[] = { 65535, 65535, 65535};\r\n    int threads[10];\r\n    for (int j = 0; j &lt; max_dimensionality; ++j)\r\n        threads[j] = 1;\r\n\r\n    int b = size \/ (max_threads[0] * max_blocks[0]);\r\n    if (b == 0)\r\n    {\r\n        int b = size \/ max_threads[0];\r\n        if (size % max_threads[0] != 0)\r\n            b++;\r\n\r\n        if (b == 1)\r\n            max_threads[0] = size;\r\n\r\n        \/\/ done. return the result.\r\n        blocks[0] = b;\r\n        threads[0] = max_threads[0];\r\n        make_results(blocks, threads, max_dimensionality, tile_size, tiles);\r\n        return;\r\n\r\n    }\r\n\r\n    int sqrt_size = sqrt((float)size \/ max_threads[0]);\r\n    sqrt_size++;\r\n\r\n    int b2 = sqrt_size \/ max_blocks[1];\r\n    if (b2 == 0)\r\n    {\r\n        int b = sqrt_size;\r\n\r\n        \/\/ done. return the result.\r\n        blocks[0] = blocks[1] = b;\r\n        threads[0] = max_threads[0];\r\n        make_results(blocks, threads, max_dimensionality, tile_size, tiles);\r\n        return;\r\n    }\r\n    throw;\r\n}\r\n\r\n__device__ int t2l()\r\n{\r\n    int i = (threadIdx.x\r\n         + blockDim.x * blockIdx.x\r\n         + blockDim.x * gridDim.x * blockDim.y * blockIdx.y\r\n         + blockDim.x * gridDim.x * threadIdx.y);\r\n    return i;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\n__global__ void init(T * v, int n)\r\n{\r\n    int i = t2l();\r\n\r\n    if (i &gt; n) \/\/ we want v[n] set!\r\n        return;\r\n\r\n    if (i &lt; 0)\r\n        return;\r\n\r\n    T w = 1.0 \/ n;\r\n    v[i] = i * w;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\nT func(T * a, int i, int n)\r\n{\r\n    T vv = a[i];\r\n    T w = 1.0L \/ (1 + vv * vv);\r\n    if (i == 0)\r\n        ;\r\n    else if (i == n)\r\n        ;\r\n    else if (i % 2 == 0)\r\n    {\r\n        w = w * 2;\r\n    }\r\n    else\r\n    {\r\n        w = w * 4;\r\n    }\r\n    return w;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\n__global__ void map(T * v, int n)\r\n{\r\n    int i = t2l();\r\n\r\n    if (i &gt; n) \/\/ we want v[n] set!\r\n        return;\r\n\r\n    if (i &lt; 0)\r\n        return;\r\n\r\n    T vv = v[i];\r\n\r\n    T w = 1.0 \/ (1 + vv * vv);\r\n    if (i == 0)\r\n        ;\r\n    else if (i == n)\r\n        ;\r\n    else if (i % 2 == 0)\r\n    {\r\n        w = w * 2;\r\n    }\r\n    else\r\n    {\r\n        w = w * 4;\r\n    }\r\n    v[i] = w;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\n__global__ void kernel_reduce_a(T * a, int n, int s)\r\n{\r\n    int i = t2l();\r\n    i *= s;\r\n\r\n    if (i &gt;= n)\r\n        return;\r\n\r\n    if (i &lt; 0)\r\n        return;\r\n\r\n    a[i] += a[i + s\/2];\r\n}\r\n\r\n\/\/ From http:\/\/graphics.stanford.edu\/~seander\/bithacks.html#IntegerLogObvious\r\nint flog2(int v)\r\n{\r\n    int x = 0;\r\n    while ((v = (v &gt;&gt; 1)) != 0)\r\n    {\r\n        x++;\r\n    }\r\n    return (int)x;\r\n}\r\n\r\n\/\/ Compute the 2 ** v.\r\nint pow2(int v)\r\n{\r\n    int value = 1;\r\n    for ( ; v &gt; 0; v--)\r\n        value &lt;&lt;= 1;\r\n    return value;\r\n}\r\n\r\nvoid CHECK(cudaError_t x)\r\n{\r\n    if (x != 0)\r\n    {\r\n        std::cout &lt;&lt; &quot;CUDA error &quot; &lt;&lt; x &lt;&lt; &quot;\\n&quot;;\r\n        exit(1);\r\n    }\r\n}\r\n\r\nint main()\r\n{\r\n    int np1 = pow2(24);\r\n    int n = np1 - 1;\r\n\r\n#define TYPE float\r\n    {\r\n        struct _timeb  t1;\r\n        struct _timeb  t2;\r\n        std::cout &lt;&lt; &quot;Starting tests...\\n&quot;;\r\n        _ftime_s(&amp;t1);\r\n\r\n        TYPE * da = 0;\r\n        cudaError_t r1 = cudaMalloc(&amp;da, np1 * sizeof(TYPE));\r\n        CHECK(r1);\r\n        {\r\n            dim3 tile;\r\n            dim3 tiles;\r\n\r\n            l2t(np1, &amp;tile, &amp;tiles);\r\n\r\n            \/\/ set up xi&#39;s.\r\n            init&lt;TYPE&gt;&lt;&lt;&lt;tiles, tile&gt;&gt;&gt;(da, n);\r\n\r\n            \/\/ Map.\r\n            map&lt;TYPE&gt;&lt;&lt;&lt;tiles, tile&gt;&gt;&gt;(da, n);\r\n        }\r\n\r\n        \/\/ Reduce.\r\n        int levels = flog2(np1);\r\n        for (int level = 0; level &lt; levels; ++level)\r\n        {\r\n            int step = pow2(level+1);\r\n            int threads = np1 \/ step;\r\n\r\n            dim3 tile;\r\n            dim3 tiles;\r\n\r\n            l2t(threads, &amp;tile, &amp;tiles);\r\n\r\n            kernel_reduce_a&lt;TYPE&gt;&lt;&lt;&lt;tiles, tile&gt;&gt;&gt;(da, n, step);\r\n        }\r\n        TYPE a;\r\n        int rv3 = cudaMemcpy(&amp;a, da, sizeof(TYPE) * 1, cudaMemcpyDeviceToHost);\r\n        cudaFree(da);\r\n        TYPE w = 4.0L \/ 3 \/ n;\r\n\r\n        _ftime(&amp;t2);\r\n        std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; &quot; s.\\n&quot;;\r\n        std::cout &lt;&lt; n &lt;&lt; &quot; &quot; &lt;&lt; std::setprecision(44) &lt;&lt; a * w;\r\n        std::cout &lt;&lt; &quot;\\n&quot;;\r\n\r\n    }\r\n    return 0;\r\n}\r\n<\/pre>\n<h3>Figure&nbsp;5(a). Implementation of Composite Simpson&#39;s Rule for&nbsp;&pi;&nbsp;using CUDA Driver API.<\/h3>\n<pre lang=\"c\">#include &lt;stdio.h&gt;\r\n#include &lt;string.h&gt;\r\n#include &lt;cuda.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;iomanip&gt;\r\n\r\nvoid linear_to_tile(int size, int max_dimensionality, size_t * threads, size_t * blocks)\r\n{\r\n    for (int j = 0; j &lt; max_dimensionality; ++j)\r\n        blocks[j] = 1;\r\n    \/\/int max_threads[] = { 512, 512, 64};\r\n    int max_threads[] = { 512, 64, 64};\r\n    int max_blocks[] = { 65535, 65535, 65535};\r\n    for (int j = 0; j &lt; max_dimensionality; ++j)\r\n        threads[j] = 1;\r\n\r\n    int b = size \/ (max_threads[0] * max_blocks[0]);\r\n    if (b == 0)\r\n    {\r\n        int b = size \/ max_threads[0];\r\n        if (size % max_threads[0] != 0)\r\n            b++;\r\n\r\n        if (b == 1)\r\n            max_threads[0] = size;\r\n\r\n        \/\/ done. return the result.\r\n        blocks[0] = b;\r\n        threads[0] = max_threads[0];\r\n        return;\r\n    }\r\n\r\n    int sqrt_size = sqrt((float)size \/ max_threads[0]);\r\n    sqrt_size++;\r\n\r\n    int b2 = sqrt_size \/ max_blocks[1];\r\n    if (b2 == 0)\r\n    {\r\n        int b = sqrt_size;\r\n\r\n        \/\/ done. return the result.\r\n        blocks[0] = blocks[1] = b;\r\n        threads[0] = max_threads[0];\r\n        return;\r\n    }\r\n    throw;\r\n}\r\n\r\nvoid CpuInit(float * v, int n)\r\n{\r\n    float w = 1.0 \/ n;\r\n    for (int i = 0; i &lt;= n; ++i)\r\n        v[i] = i * w;\r\n}\r\n\r\nvoid CpuMap(float * v, int n)\r\n{\r\n    for (int i = 0; i &lt;= n; ++i)\r\n    {\r\n        float vv = v[i];\r\n\r\n        float w = 1.0 \/ (1 + vv * vv);\r\n        if (i == 0)\r\n            ;\r\n        else if (i == n)\r\n            ;\r\n        else if (i % 2 == 0)\r\n        {\r\n            w = w * 2;\r\n        }\r\n        else\r\n        {\r\n            w = w * 4;\r\n        }\r\n        v[i] = w;\r\n    }\r\n}\r\n\r\nvoid cpu_reduce(float * a, int n)\r\n{\r\n    \/\/ note: use double here because of accumulation of truncation error due to adding small numbers to a large number.\r\n    double total = 0;\r\n    for (int i = 0; i &lt;= n; ++i)\r\n        total += a[i];\r\n    a[0] = total;\r\n}\r\n\r\n\/\/ From http:\/\/graphics.stanford.edu\/~seander\/bithacks.html#IntegerLogObvious\r\nint flog2(int v)\r\n{\r\n    int x = 0;\r\n    while ((v = (v &gt;&gt; 1)) != 0)\r\n    {\r\n        x++;\r\n    }\r\n    return (int)x;\r\n}\r\n\r\n\/\/ Compute the 2 ** v.\r\nint pow2(int v)\r\n{\r\n    int value = 1;\r\n    for ( ; v &gt; 0; v--)\r\n        value &lt;&lt;= 1;\r\n    return value;\r\n}\r\n\r\nvoid CHECK(CUresult x)\r\n{\r\n    if (x != 0)\r\n    {\r\n        std::cout &lt;&lt; &quot;Error &quot; &lt;&lt; x &lt;&lt; &quot;\\n&quot;;\r\n        throw new std::string(&quot;Error &quot; + x);\r\n    }\r\n}\r\n\r\nvoid CpuPi(int np1)\r\n{\r\n    int n = np1 - 1;\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; &quot;Starting tests...\\n&quot;;\r\n    _ftime_s(&amp;t1);\r\n    float * a = (float*)malloc(np1 * sizeof(float));\r\n    CpuInit(a, n);\r\n    CpuMap(a, n);\r\n    cpu_reduce(a, n);\r\n    float w = *a * 4.0 \/ 3 \/ n;\r\n    free(a);\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; &quot; s.\\n&quot;;\r\n\r\n    std::cout &lt;&lt; n &lt;&lt; &quot; &quot; &lt;&lt; std::setprecision(44) &lt;&lt; w;\r\n    std::cout &lt;&lt; &quot;\\n&quot;;\r\n}\r\n\r\nvoid TryPlatform(int np1)\r\n{\r\n    try {\r\n\r\n        int n = np1 - 1;\r\n        CUresult err;\r\n\r\n        cuInit(0);\r\n\r\n        int num = 0;\r\n        err = cuDeviceGetCount(&amp;num);\r\n        CHECK(err);\r\n\r\n        for (int d = 0; d &lt; num; ++d)\r\n        {\r\n            CUdevice device;\r\n            err = cuDeviceGet(&amp;device, d);\r\n            CHECK(err);\r\n\r\n            CUcontext context;\r\n            err = cuCtxCreate(&amp;context, 0, device);\r\n            CHECK(err);\r\n\r\n            \/\/ create the program\r\n            CUmodule program;\r\n            err = cuModuleLoad(&amp;program, &quot;pi-driver-kernels.ptx&quot;);\r\n            CHECK(err);\r\n\r\n            struct _timeb  t1;\r\n            struct _timeb  t2;\r\n            std::cout &lt;&lt; &quot;Starting tests...\\n&quot;;\r\n            _ftime_s(&amp;t1);\r\n\r\n            \/\/ for (int c = 0; c &lt; 7; ++c)\r\n            {\r\n                CUdeviceptr da;\r\n                err = cuMemAlloc(&amp;da, sizeof(float)*np1);\r\n                CHECK(err);\r\n\r\n                {\r\n                    size_t tile[3];\r\n                    size_t tiles[3];\r\n                    int max_dimensionality = 3;\r\n\r\n                    linear_to_tile(np1, max_dimensionality, tile, tiles);\r\n\r\n                    {\r\n                        \/\/ execute the kernel\r\n                        CUfunction kernel_init;\r\n                        err = cuModuleGetFunction(&amp;kernel_init, program, &quot;_Z4initIfEvPT_i&quot;);\r\n                        CHECK(err);\r\n\r\n                        void* args[] = { &amp;da, &amp;n };\r\n                        err = cuLaunchKernel(kernel_init,\r\n                            tiles[0], tiles[1], tiles[2], tile[0], tile[1], tile[2],\r\n                            0, 0, args, 0);\r\n                        CHECK(err);\r\n                    }\r\n\r\n                    {\r\n                        \/\/ execute the kernel\r\n                        CUfunction kernel_map;\r\n                        err = cuModuleGetFunction(&amp;kernel_map, program, &quot;_Z3mapIfEvPT_i&quot;);\r\n                        CHECK(err);\r\n\r\n                        void* args[] = { &amp;da, &amp;n };\r\n                        err = cuLaunchKernel(kernel_map,\r\n                            tiles[0], tiles[1], tiles[2], tile[0], tile[1], tile[2],\r\n                            0, 0, args, 0);\r\n                        CHECK(err);\r\n                    }\r\n                }\r\n\r\n                \/\/ Reduce.\r\n                int levels = flog2(np1);\r\n                for (int level = 0; level &lt; levels; ++level)\r\n                {\r\n                    int step = pow2(level+1);\r\n                    int threads = np1 \/ step;\r\n\r\n                    size_t tile[3];\r\n                    size_t tiles[3];\r\n                    int max_dimensionality = 3;\r\n\r\n                    linear_to_tile(threads, max_dimensionality, tile, tiles);\r\n\r\n                    \/\/ execute the kernel\r\n                    CUfunction kernel_reduce;\r\n                    err = cuModuleGetFunction(&amp;kernel_reduce, program, &quot;_Z15kernel_reduce_aIfEvPT_ii&quot;);\r\n                    CHECK(err);\r\n\r\n                    void* args[] = { &amp;da, &amp;n, &amp;step };\r\n                    err = cuLaunchKernel(kernel_reduce, tiles[0], tiles[1], tiles[2], tile[0], tile[1], tile[2], 0, 0, args, 0);\r\n                    CHECK(err);\r\n                }\r\n\r\n                float a[100];\r\n                int sss = 1;\r\n\r\n                \/\/ read output array\r\n                err = cuMemcpyDtoH(a, da, sss*sizeof(float));\r\n                CHECK(err);\r\n                err = cuMemFree(da);\r\n                CHECK(err);\r\n\r\n                float w = *a * 4.0 \/ 3 \/ n;\r\n\r\n                _ftime(&amp;t2);\r\n                std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; &quot; s.\\n&quot;;\r\n                std::cout &lt;&lt; n &lt;&lt; &quot; &quot; &lt;&lt; std::setprecision(44) &lt;&lt; w;\r\n                std::cout &lt;&lt; &quot;\\n&quot;;\r\n\r\n                np1 *= 2;\r\n                n = np1 - 1;\r\n            }\r\n        }\r\n    }\r\n    catch(...)\r\n    {\r\n        std::cout &lt;&lt; &quot;Whoops!\\n&quot;;\r\n    }\r\n}\r\n\r\nint main()\r\n{\r\n    int np1 = pow2(24);\r\n    int n = np1 - 1;\r\n\r\n    printf(&quot;np1 = %d\\n&quot;, np1);\r\n\r\n    CpuPi(np1);\r\n\r\n    \/\/ Only one platform -- CUDA!\r\n    TryPlatform(np1);\r\n\r\n    return 0;\r\n}\r\n<\/pre>\n<h3>Figure&nbsp;5(b). Kernel for the CUDA Driver API solution.<\/h3>\n<pre lang=\"c\">__device__ int t2l()\r\n{\r\n    int i = (threadIdx.x\r\n         + blockDim.x * blockIdx.x\r\n         + blockDim.x * gridDim.x * blockDim.y * blockIdx.y\r\n         + blockDim.x * gridDim.x * threadIdx.y);\r\n    return i;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\n__global__ void init(T * v, int n)\r\n{\r\n    int i = t2l();\r\n\r\n    if (i &gt; n) \/\/ we want v[n] set!\r\n        return;\r\n\r\n    if (i &lt; 0)\r\n        return;\r\n\r\n    T w = 1.0 \/ n;\r\n    v[i] = i * w;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\nT func(T * a, int i, int n)\r\n{\r\n    T vv = a[i];\r\n    T w = 1.0L \/ (1 + vv * vv);\r\n    if (i == 0)\r\n        ;\r\n    else if (i == n)\r\n        ;\r\n    else if (i % 2 == 0)\r\n    {\r\n        w = w * 2;\r\n    }\r\n    else\r\n    {\r\n        w = w * 4;\r\n    }\r\n    return w;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\n__global__ void map(T * v, int n)\r\n{\r\n    int i = t2l();\r\n\r\n    if (i &gt; n) \/\/ we want v[n] set!\r\n        return;\r\n\r\n    if (i &lt; 0)\r\n        return;\r\n\r\n    T vv = v[i];\r\n\r\n    T w = 1.0 \/ (1 + vv * vv);\r\n    if (i == 0)\r\n        ;\r\n    else if (i == n)\r\n        ;\r\n    else if (i % 2 == 0)\r\n    {\r\n        w = w * 2;\r\n    }\r\n    else\r\n    {\r\n        w = w * 4;\r\n    }\r\n    v[i] = w;\r\n}\r\n\r\ntemplate&lt;class T&gt;\r\n__global__ void kernel_reduce_a(T * a, int n, int s)\r\n{\r\n    int i = t2l();\r\n    i *= s;\r\n\r\n    if (i &gt;= n)\r\n        return;\r\n\r\n    if (i &lt; 0)\r\n        return;\r\n\r\n    a[i] += a[i + s\/2];\r\n}\r\n\r\nvoid fun1()\r\n{\r\n    float * a;\r\n    int n;\r\n    init&lt;float&gt;&lt;&lt;&lt;1,1&gt;&gt;&gt;(a, n);\r\n    map&lt;float&gt;&lt;&lt;&lt;1,1&gt;&gt;&gt;(a, n);\r\n    kernel_reduce_a&lt;float&gt;&lt;&lt;&lt;1,1&gt;&gt;&gt;(a, n, n);\r\n}\r\n<\/pre>\n<h3>Figure&nbsp;6. Implementation of Composite Simpson&#39;s Rule for&nbsp;&pi;&nbsp;using OpenCL (and a sequential CPU implementation).<\/h3>\n<pre lang=\"c\">#include &lt;stdio.h&gt;\r\n#include &lt;string.h&gt;\r\n#include &lt;cl\/cl.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;iomanip&gt;\r\n\r\nchar * kernels = &quot;size_t t2l()\\\r\n{\\\r\n    size_t gsx = get_global_size(0);\\\r\n    size_t gsy = get_global_size(1);\\\r\n    size_t gsz = get_global_size(2);\\\r\n    size_t i = \\\r\n        get_global_id(0)\\\r\n        + get_global_id(1) * gsx \\\r\n        + get_global_id(2) * gsx * gsy\\\r\n        ;\\\r\n    return i;\\\r\n}\\\r\n\\\r\n__kernel void init(__global float * v, int n)\\\r\n{\\\r\n    size_t i = t2l();\\\r\n    if (i &gt; n)\\\r\n        return;\\\r\n        \\\r\n    if (i &lt; 0)\\\r\n        return;\\\r\n        \\\r\n    float w = 1.0 \/ n;\\\r\n    v[i] = i * w;\\\r\n}\\\r\n\\\r\n__kernel void map(__global float * v, int n)\\\r\n{\\\r\n    size_t i = t2l();\\\r\n         \\\r\n    if (i &gt; n)\\\r\n        return;\\\r\n        \\\r\n    if (i &lt; 0)\\\r\n        return;\\\r\n        \\\r\n    float vv = v[i];\\\r\n    \\\r\n    float w = 1.0 \/ (1 + vv * vv);\\\r\n    if (i == 0)\\\r\n        ;\\\r\n    else if (i == n)\\\r\n        ;\\\r\n    else if (i % 2 == 0)\\\r\n    {\\\r\n        w = w * 2;\\\r\n    }\\\r\n    else\\\r\n    {\\\r\n        w = w * 4;\\\r\n    }\\\r\n    v[i] = w;\\\r\n}\\\r\n\\\r\n__kernel void kernel_reduce_a(__global float * a, int n, int s)\\\r\n{\\\r\n    size_t i = t2l();\\\r\n    \\\r\n    i = i * s;\\\r\n         \\\r\n    if (i &gt;= n)\\\r\n        return;\\\r\n        \\\r\n    if (i &lt; 0)\\\r\n        return;\\\r\n        \\\r\n    a[i] = a[i] + a[i + s\/2];\\\r\n    if (i == 0)\\\r\n       a[1] = s\/2;\\\r\n}\\\r\n&quot;;\r\n\r\n\r\nvoid get_device_info(cl_device_id device_id, cl_device_info device_info, std::string* value, cl_int * err)\r\n{\r\n    size_t size = 0;\r\n\r\n    \/\/  Get all params for the given platform id, first query their size, then get the actual data\r\n    *err = clGetDeviceInfo(device_id, device_info, 0, NULL, &amp;size);\r\n    value-&gt;resize(size);\r\n    *err = clGetDeviceInfo(device_id, device_info, size, &amp;((*value)[0]), NULL);\r\n}\r\n\r\n\r\nvoid get_platform_info(cl_platform_id platform_id, cl_platform_info platform_info, std::string* value, cl_int * err)\r\n{\r\n    ::size_t size = 0;\r\n\r\n    \/\/  Get all params for the given platform id, first query their size, then get the actual data\r\n    *err = clGetPlatformInfo(platform_id, platform_info, 0, NULL, &amp;size);\r\n    value-&gt;resize(size);\r\n    *err = clGetPlatformInfo(platform_id, platform_info, size, &amp;((*value)[0]), NULL);\r\n}\r\n\r\n\r\ncl_program load_binary(cl_context context, cl_platform_id platform, cl_device_id device, char * file_name, cl_int * err)\r\n{\r\n    size_t len;\r\n\r\n    \/\/ for now, compile inlined source.\r\n    cl_program program = clCreateProgramWithSource(context, 1, (const char **)&amp;kernels, 0, err);\r\n    char **binaries = 0;\r\n    if (*err) \r\n    {\r\n        return 0;\r\n    }\r\n    *err = clBuildProgram(program, 1, &amp;device, &quot;-cl-opt-disable&quot;, NULL, NULL);\r\n    \/\/char log[5000];\r\n    \/\/cl_int err2 = clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);\r\n    \/\/printf(&quot;log = %s\\n&quot;, log);\r\n    return program;\r\n}\r\n\r\nint the_blocksize = 256;\r\n\r\nvoid l2t(int size, int max_dimensionality, size_t * tile_size, size_t * tiles)\r\n{\r\n    for (int j = 0; j &lt; max_dimensionality; ++j)\r\n        tiles[j] = 1;\r\n    int max_threads[] = { the_blocksize, 64, 64};\r\n    int max_blocks[] = { 65535, 65535, 65535};\r\n    for (int j = 0; j &lt; max_dimensionality; ++j)\r\n        tile_size[j] = 1;\r\n\r\n    int b = size \/ (max_threads[0] * max_blocks[0]);\r\n    if (b == 0)\r\n    {\r\n        int b = size \/ max_threads[0];\r\n        if (size % max_threads[0] != 0)\r\n            b++;\r\n\r\n        if (b == 1)\r\n            max_threads[0] = size;\r\n\r\n        \/\/ done. return the result.\r\n        tiles[0] = b;\r\n        tile_size[0] = max_threads[0];\r\n\r\n        \/\/ OpenCL uses multiples of tile_size.\r\n        tiles[0] *= tile_size[0];\r\n        return;\r\n\r\n    }\r\n\r\n    int sqrt_size = sqrt((float)size \/ max_threads[0]);\r\n    sqrt_size++;\r\n\r\n    int b2 = sqrt_size \/ max_blocks[1];\r\n    if (b2 == 0)\r\n    {\r\n        int b = sqrt_size;\r\n\r\n        \/\/ done. return the result.\r\n        tiles[0] = tiles[1] = b;\r\n        tile_size[0] = max_threads[0];\r\n\r\n        \/\/ OpenCL uses multiples of tile_size.\r\n        tiles[0] *= tile_size[0];\r\n\/\/        tiles[1] *= tile_size[1];\r\n        return;\r\n    }\r\n    throw;\r\n}\r\n\r\nvoid CpuInit(float * v, int n)\r\n{\r\n    float w = 1.0 \/ n;\r\n    for (int i = 0; i &lt;= n; ++i)\r\n        v[i] = i * w;\r\n}\r\n\r\nvoid CpuMap(float * v, int n)\r\n{\r\n    for (int i = 0; i &lt;= n; ++i)\r\n    {\r\n        float vv = v[i];\r\n    \r\n        float w = 1.0 \/ (1 + vv * vv);\r\n        if (i == 0)\r\n            ;\r\n        else if (i == n)\r\n            ;\r\n        else if (i % 2 == 0)\r\n        {\r\n            w = w * 2;\r\n        }\r\n        else\r\n        {\r\n            w = w * 4;\r\n        }\r\n        v[i] = w;\r\n    }\r\n}\r\n\r\n\r\nvoid cpu_reduce(float * a, int n)\r\n{\r\n    \/\/ note: use double here because of accumulation of truncation error due to adding small numbers to a large number.\r\n    double total = 0;\r\n    for (int i = 0; i &lt;= n; ++i)\r\n        total += a[i];\r\n    a[0] = total;\r\n}\r\n\r\n\r\n\/\/ From http:\/\/graphics.stanford.edu\/~seander\/bithacks.html#IntegerLogObvious\r\nint flog2(int v)\r\n{\r\n    int x = 0;\r\n    while ((v = (v &gt;&gt; 1)) != 0)\r\n    {\r\n        x++;\r\n    }\r\n    return (int)x;\r\n}\r\n\r\n\/\/ Compute the 2 ** v.\r\nint pow2(int v)\r\n{\r\n    int value = 1;\r\n    for ( ; v &gt; 0; v--)\r\n        value &lt;&lt;= 1;\r\n    return value;\r\n}\r\n\r\nvoid CHECK(cl_int x)\r\n{\r\n    if (x != 0)\r\n    {\r\n        std::cout &lt;&lt; &quot;Error &quot; &lt;&lt; x &lt;&lt; &quot;\\n&quot;;\r\n        throw new std::string(&quot;Error &quot; + x);\r\n    }\r\n}\r\n\r\nvoid CpuPi(int np1)\r\n{\r\n    int n = np1 - 1;\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; &quot;Starting tests...\\n&quot;;\r\n    _ftime_s(&amp;t1);\r\n    float * a = (float*)malloc(np1 * sizeof(float));\r\n    CpuInit(a, n);\r\n    CpuMap(a, n);\r\n    cpu_reduce(a, n);\r\n    float w = *a * 4.0 \/ 3 \/ n;\r\n    free(a);\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; &quot; s.\\n&quot;;\r\n\r\n    std::cout &lt;&lt; n &lt;&lt; &quot; &quot; &lt;&lt; std::setprecision(44) &lt;&lt; w;\r\n    std::cout &lt;&lt; &quot;\\n&quot;;\r\n}\r\n\r\nvoid TryPlatform(cl_platform_id platform, int np1)\r\n{\r\n    try {\r\n\r\n        int n = np1 - 1;\r\n\r\n        \/\/ Choose the appropriate platform for this linked DLL.\r\n        cl_device_id dev_id[10];\r\n        cl_uint num;\r\n        cl_int err;\r\n        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 10, dev_id, &amp;num);\r\n        CHECK(err);\r\n\r\n        printf(&quot;devices = %d\\n&quot;, num);\r\n\r\n        for (int d = 0; d &lt; num; ++d)\r\n        {\r\n            char tbuf[500];\r\n            size_t sz;\r\n\r\n            std::cout &lt;&lt; &quot;            Device [&quot; &lt;&lt; d &lt;&lt; &quot;]&quot; &lt;&lt; std::endl;\r\n            std::cout &lt;&lt; &quot;                type                          = &quot;;\r\n            {\r\n                cl_device_type type;\r\n                err = clGetDeviceInfo(dev_id[d], CL_DEVICE_TYPE, sizeof(type), &amp;type, NULL);\r\n                CHECK(err);\r\n\r\n                if (type &amp; CL_DEVICE_TYPE_DEFAULT       ) std::cout &lt;&lt; &quot;CL_DEVICE_TYPE_DEFAULT &quot;    ;\r\n                if (type &amp; CL_DEVICE_TYPE_CPU           ) std::cout &lt;&lt; &quot;CL_DEVICE_TYPE_CPU &quot;        ;\r\n                if (type &amp; CL_DEVICE_TYPE_GPU           ) std::cout &lt;&lt; &quot;CL_DEVICE_TYPE_GPU &quot;        ;\r\n                if (type &amp; CL_DEVICE_TYPE_ACCELERATOR   ) std::cout &lt;&lt; &quot;CL_DEVICE_TYPE_ACCELERATOR &quot;;\r\n\r\n                std::cout &lt;&lt; std::endl;\r\n            }\r\n\r\n            err = clGetDeviceInfo(dev_id[d], CL_DEVICE_NAME, sizeof(tbuf), tbuf, NULL);\r\n            CHECK(err);\r\n            std::cout &lt;&lt; &quot;                name                          = &quot; &lt;&lt; tbuf &lt;&lt; std::endl;\r\n\r\n            \/\/ Choose device.\r\n            cl_device_id device = dev_id[d];\r\n\r\n            \/\/ create the OpenCL context on a GPU device\r\n            cl_context_properties props[4];\r\n            props[0] = CL_CONTEXT_PLATFORM;\r\n            props[1] = (cl_context_properties)platform;\r\n            props[2] = 0;\r\n            cl_context context = clCreateContext(props, 1, &amp;device, NULL, NULL, &amp;err);\r\n            CHECK(err);\r\n    \r\n            \/\/ create the program\r\n            cl_program program = load_binary(context, platform, device, &quot;pi-opencl-kernels.o&quot;, &amp;err);\r\n            CHECK(err);\r\n\r\n            struct _timeb  t1;\r\n            struct _timeb  t2;\r\n            std::cout &lt;&lt; &quot;Starting tests...\\n&quot;;\r\n            _ftime_s(&amp;t1);\r\n\r\n            \/\/ for (int c = 0; c &lt; 7; ++c)\r\n            {\r\n                cl_int r1;\r\n                cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float)*np1, NULL, &amp;r1);\r\n                CHECK(r1);\r\n\r\n                size_t asize = sizeof(float);\r\n                float * a = (float*) malloc(asize);\r\n                {\r\n                    size_t tile[3];\r\n                    size_t tiles[3];\r\n                    int max_dimensionality = 3;\r\n\r\n                    l2t(np1, max_dimensionality, tile, tiles);\r\n\r\n                    {\r\n                        \/\/ execute the kernel\r\n                        cl_kernel kernel_init = clCreateKernel(program, &quot;init&quot;, &amp;err);\r\n                        CHECK(err);\r\n                        err = clSetKernelArg(kernel_init, 0, sizeof(cl_mem), (void *) &amp;da);\r\n                        CHECK(err);\r\n                        err = clSetKernelArg(kernel_init, 1, sizeof(int), (void *)&amp;n);\r\n                        CHECK(err);\r\n                        \/\/ create a command-queue\r\n                        cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &amp;err);\r\n                        CHECK(err);\r\n                        cl_event my_event;\r\n                        err = clEnqueueNDRangeKernel(cmd_queue, kernel_init, max_dimensionality, NULL, tiles, tile, 0, NULL, &amp;my_event);\r\n                        CHECK(err);\r\n                        err = clWaitForEvents(1, &amp;my_event);\r\n                        CHECK(err);\r\n                        err = clReleaseCommandQueue(cmd_queue);\r\n                        CHECK(err);\r\n                        err = clReleaseKernel(kernel_init);\r\n                        CHECK(err);\r\n                    }\r\n\r\n                    {\r\n                        \/\/ execute the kernel\r\n                        cl_kernel kernel_map = clCreateKernel(program, &quot;map&quot;, &amp;err);\r\n                        CHECK(err);\r\n                        err = clSetKernelArg(kernel_map, 0, sizeof(cl_mem), (void *) &amp;da);\r\n                        CHECK(err);\r\n                        err = clSetKernelArg(kernel_map, 1, sizeof(int), (void *)&amp;n);\r\n                        CHECK(err);\r\n                        cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &amp;err);\r\n                        CHECK(err);\r\n                        cl_event my_event;\r\n                        err = clEnqueueNDRangeKernel(cmd_queue, kernel_map, max_dimensionality, NULL, tiles, tile, 0, NULL, &amp;my_event);\r\n                        CHECK(err);\r\n                        err = clWaitForEvents(1, &amp;my_event);\r\n                        CHECK(err);\r\n                        err = clReleaseCommandQueue(cmd_queue);\r\n                        CHECK(err);\r\n                        err = clReleaseKernel(kernel_map);\r\n                        CHECK(err);\r\n                    }\r\n                }\r\n\r\n                \/\/ Reduce.\r\n                int levels = flog2(np1);\r\n                for (int level = 0; level &lt; levels; ++level)\r\n                {\r\n                    int step = pow2(level+1);\r\n                    int threads = np1 \/ step;\r\n\r\n                    size_t tile[3];\r\n                    size_t tiles[3];\r\n                    int max_dimensionality = 3;\r\n\r\n                    l2t(threads, max_dimensionality, tile, tiles);\r\n\r\n                    \/\/ execute the kernel\r\n                    cl_kernel kernel_reduce = clCreateKernel(program, &quot;kernel_reduce_a&quot;, &amp;err);\r\n                    CHECK(err);\r\n                    err = clSetKernelArg(kernel_reduce, 0, sizeof(cl_mem), (void *) &amp;da);\r\n                    CHECK(err);\r\n                    err = clSetKernelArg(kernel_reduce, 1, sizeof(int), (void *)&amp;n);\r\n                    CHECK(err);\r\n                    err = clSetKernelArg(kernel_reduce, 2, sizeof(int), (void *)&amp;step);\r\n                    CHECK(err);\r\n                    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &amp;err);\r\n                    CHECK(err);\r\n                    cl_event my_event;\r\n                    err = clEnqueueNDRangeKernel(cmd_queue, kernel_reduce, max_dimensionality, NULL, tiles, tile, 0, NULL, &amp;my_event);\r\n                    CHECK(err);\r\n                    err = clWaitForEvents(1, &amp;my_event);\r\n                    CHECK(err);\r\n                    err = clReleaseCommandQueue(cmd_queue);\r\n                    CHECK(err);\r\n                    err = clReleaseKernel(kernel_reduce);\r\n                    CHECK(err);\r\n                }\r\n\r\n                \/\/ read output array\r\n                cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &amp;err);\r\n                CHECK(err);\r\n                err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, sizeof(float), a, 0, NULL, NULL);\r\n                CHECK(err);\r\n                err = clReleaseCommandQueue(cmd_queue);\r\n                CHECK(err);\r\n                err = clReleaseMemObject(da);\r\n                CHECK(err);\r\n\r\n                float a1 = *a;\r\n                float a2 = a1 * 4.0;\r\n                float a3 = a2 \/ 3;\r\n                float a4 = a3 \/ n;\r\n                float w = ((*a * 4.0) \/ 3) \/ n;\r\n\r\n                _ftime(&amp;t2);\r\n                std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; &quot; s.\\n&quot;;\r\n                std::cout &lt;&lt; n &lt;&lt; &quot; &quot; &lt;&lt; std::setprecision(44) &lt;&lt; w;\r\n                std::cout &lt;&lt; &quot;\\n&quot;;\r\n\r\n            }\r\n            err = clReleaseContext(context);\r\n            CHECK(err);\r\n            err = clReleaseProgram(program);\r\n            CHECK(err);\r\n        }\r\n    }\r\n    catch(...)\r\n    {\r\n        std::cout &lt;&lt; &quot;Whoops!\\n&quot;;\r\n    }\r\n}\r\n\r\nint main(int argc, char *argv[])\r\n{\r\n    \/\/ input block size\r\n    argc--; argv++;\r\n    if (argc)\r\n    {\r\n        the_blocksize = atoi(*argv);\r\n    }\r\n\r\n\/\/    int np1 = pow2(24);\r\n    int np1 = pow2(24);\r\n    int n = np1 - 1;\r\n\r\n    printf(&quot;np1 = %d\\n&quot;, np1);\r\n\r\n    CpuPi(np1);\r\n\r\n    cl_int err = NULL;\r\n    cl_platform_id plat_id[20];\r\n    cl_uint num;\r\n\r\n    err = clGetPlatformIDs(20, plat_id, &amp;num);\r\n\r\n    if (err != NULL)\r\n        return err;\r\n\r\n    printf(&quot;Number of platforms = %d\\n&quot;, num);\r\n\r\n    cl_uint i = 0;\r\n    for (i = 0; i &lt; num; ++i)\r\n    {\r\n        char buf[500];\r\n        size_t sz;\r\n        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_PROFILE, 500, buf, &amp;sz);\r\n        if (err != NULL)\r\n            return err;\r\n        printf(&quot;Platform profile: %s\\n&quot;, buf);\r\n        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_VERSION, 500, buf, &amp;sz);\r\n        if (err != NULL)\r\n            return err;\r\n        printf(&quot;Platform version: %s\\n&quot;, buf);\r\n        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_NAME, 500, buf, &amp;sz);\r\n        if (err != NULL)\r\n            return err;\r\n        printf(&quot;Platform name: %s\\n&quot;, buf);\r\n        char vendor[500];\r\n        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_VENDOR, 500, vendor, &amp;sz);\r\n        if (err != NULL)\r\n            return err;\r\n        printf(&quot;Platform vendor: %s\\n&quot;, vendor);\r\n        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_EXTENSIONS, 500, buf, &amp;sz);\r\n        if (err != NULL)\r\n            return err;\r\n        printf(&quot;Platform extensions: %s\\n&quot;, buf);\r\n\r\n        TryPlatform(plat_id[i], np1);\r\n    }\r\n    return 0;\r\n}\r\n<\/pre>\n<h2>Discussion<\/h2>\n<p style=\"text-align: justify;\">Three parallel solutions for estimating the value of&nbsp;<span class=\"texhtml\" style=\"line-height: 1em; font-family: &quot;Times New Roman&quot;, serif; font-size: 15px; white-space: nowrap;\">&pi;, along with a sequential solution for comparison,<\/span>&nbsp;were implemented. &nbsp;The programs were tested on an NVIDIA&nbsp;GeForce GTX 470, an ATI Radeon HD 6450, and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. &nbsp;All solutions produced the same results, but the capabilites of these processors couldn&#39;t be more different. &nbsp;The NVIDIA GPU processor has 448 cores. &nbsp;Threads are executed in groups of 32 called a warp. &nbsp;The AMD GPU processor has 160 cores. &nbsp;Threads are executed in groups called wavefronts. &nbsp;It also uses <a href=\"http:\/\/en.wikipedia.org\/wiki\/Very_long_instruction_word\">Very Long Instruction Word<\/a>&nbsp;(VLIW<b>)<\/b>&nbsp;instructions to simplify the hardware and decrease instruction latency. &nbsp;The Intel CPU processor has 4 independent cores. &nbsp;In this experiment, the CPU is used for sequential computation and OpenCL through the AMD AMP SDK. &nbsp;For the purpose of the comparison, the block (work group) size was the same for the CUDA and OpenCL solutions. With the given hardware and software (problem size n = 2<sup>24<\/sup> &#8211; 1, programs built as 32-bit Windows applications in Release configuration), the runtimes are show below (Figure 7).&nbsp;<\/p>\n<p style=\"text-align: justify;\">Notes: (1) The CPU could run OpenCL through the AMD APP SDK v2. &nbsp;The Intel OpenCL SDK does not work with this CPU because it does not implement&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/SSE4\">SSE4<\/a>. (2) The ATI and NVIDIA graphics cards could not co-exist properly because the motherboard does not implement hybid SLI\/Crossfire multi-GPU. &nbsp;In order to get it to work, one board would have to be disabled or removed. Newer boards that contain the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hydra_Engine\">Lucid Hydra<\/a> bypass this problem.<\/p>\n<h3 style=\"text-align: justify;\">Figure 7. Runtime of the estimation of&nbsp;<span class=\"texhtml\" style=\"line-height: 1em; font-family: &quot;Times New Roman&quot;, serif; font-size: 15px; white-space: nowrap;\">&pi;<\/span>&nbsp;for various platforms and implementations<\/h3>\n<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\">\n<tbody>\n<tr>\n<td><strong>System\/Program<\/strong><\/td>\n<td><strong>Time (s)<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Sequential implementation, Intel Q6600 @ 2.51 Ghz, 4 G RAM<\/td>\n<td>0.3<\/td>\n<\/tr>\n<tr>\n<td>CUDA Runtime API with NVIDIA GTX 470<\/td>\n<td>0.08<\/td>\n<\/tr>\n<tr>\n<td>CUDA Driver API with NVIDIA GTX 470<\/td>\n<td>0.016<\/td>\n<\/tr>\n<tr>\n<td>OpenCL with NVIDIA GTX 470<\/td>\n<td>0.027<\/td>\n<\/tr>\n<tr>\n<td>OpenCL with ATI Radeon HD 6450<\/td>\n<td>2.15<\/td>\n<\/tr>\n<tr>\n<td>OpenCL with Intel Q6600 @ 2.51 Ghz, 4 G RAM<\/td>\n<td>1.05<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\">Not to much of a surprise, the CUDA and OpenCL implementations on the NVIDIA GPU run faster that the sequential implementation on the CPU&#8211;for the given problem size. &nbsp;But, what is a surprise is that the OpenCL implementation on the ATI device is much slower than the CPU sequential implementation. &nbsp;A larger problem size for n+1 could reverse the standing, but OpenCL runtime errors appear for larger problem sizes. The runtimes for the three parallel solutions vary quite a bit, even for the same device (i.e., NVIDIA GTX 470), with the CUDA runtime API more than twice as slow as the CUDA Driver API and OpenCL solutions.<\/p>\n<p style=\"text-align: justify;\">On the surface, the languages for specifying kernels appear similar. &nbsp;However, if features and capabilites are compared item by item, the differences are pronounce (Figure 8).<\/p>\n<p style=\"text-align: justify;\">The language for the CUDA framework is a C++ dialect. &nbsp;It is translated by a stand-alone compiler (<em>nvcc<\/em>, developed by NVIDIA) into standard C++. &nbsp;The C++ code is then compiled into an executable using the native C++ compiler (e.g., gcc for Linux, and MSVC++ for Windows). &nbsp;An advantage of this method is that kernel code can intermingle with host code. &nbsp;The kernel code is &nbsp;translated into PTX assembly code, the lingua franca of NVIDIA devices. &nbsp;PTX code is used directly by the CUDA Driver API. &nbsp;Kernel calls can be specified using a chevron syntax (e.g., lines 208 of Figure 4), which is translated into CUDA Runtime API calls. &nbsp;(There is no provision to translate the code into CUDA Driver API calls.) &nbsp;The chevron syntax reduces the verboseness of kernel calls that one would find with the CUDA Driver framework (lines 191 through 198 of Figure 5(a)) or the OpenCL framework (lines 200 through 216 of Figure 6(a)). &nbsp;Kernels can be extended in a number of ways, used in templates (e.g., line 65 of Figure 4), use classes, call recursive functions, etc. &nbsp;However, kernel code in C++ classes is only partially supported. &nbsp;Classes can contain device code, declared __device__, that kernels can call, but kernels themselves cannot be class members.<\/p>\n<p style=\"text-align: justify;\">The OpenCL language is similar to the CUDA C++, but very restricted. &nbsp;OpenCL is based on the C99 standard, not C++. &nbsp;Developers normally write the kernels as a &quot;char *&quot; string. &nbsp;This code is the compiled using calls to&nbsp;clCreateProgramWithSource and&nbsp;clBuildProgram. &nbsp;Multiline string literals to specify kernel code is a hassle in C++. &nbsp;The code can be written as adjacent string literals (each in double quotes), one for each line of text for the kernel. &nbsp;Alternatively, a trailing backslash (\\) at the end of each line can be employed. &nbsp;Either way, it is quite annoying. &nbsp;<\/p>\n<p style=\"text-align: justify;\">Unfortunately, there is no stand-alone compiler for OpenCL. &nbsp;A deverloper can certain write one (see here), but that introduces new problems where you must remember to use the same compiler options when loading the binary object code (see line 130 of Figure 6(b)). &nbsp;The OpenCL compilation (i.e., clCreateProgramWithSource and&nbsp;clBuildProgram) results in binary code that can be saved to a file, then reloaded for execution later, using clGetProgramInfo(). &nbsp;For NVIDIA, the binary code is PTX, which is essentially text. &nbsp;For ATI, the code is an ELF binary file.<\/p>\n<p style=\"text-align: justify;\">There are many differences between OpenCL, CUDA Runtime API, and CUDA Driver API. &nbsp;Item by item, the functions often do not correspond at all to each other. &nbsp;However, groups of API calls have similar functionality. &nbsp;The following table is a summary of the differences, similarities, and equivalences (Figure 8).<\/p>\n<h3 style=\"text-align: justify;\">Figure 8. Comparison of features between OpenCL, CUDA Runtime API, and CUDA Driver API<\/h3>\n<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\">\n<thead>\n<tr>\n<th scope=\"col\">Term feature<\/th>\n<th scope=\"col\">OpenCL<\/th>\n<th scope=\"col\">CUDA Runtime<\/th>\n<th scope=\"col\">CUDA Driver<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><em>Processor hardware<\/em><\/td>\n<td>Device<\/td>\n<td colspan=\"2\">GPU<\/td>\n<\/tr>\n<tr>\n<td><em>Processor hardware<\/em><\/td>\n<td>Compute unit<\/td>\n<td colspan=\"2\">Multiprocessor<\/td>\n<\/tr>\n<tr>\n<td><em>Processor hardware<\/em><\/td>\n<td>Processing element<\/td>\n<td colspan=\"2\">Scalar core<\/td>\n<\/tr>\n<tr>\n<td><em>Memory hardware<\/em><\/td>\n<td>Global memory<\/td>\n<td colspan=\"2\">Global memory<\/td>\n<\/tr>\n<tr>\n<td><em>Memory hardware<\/em><\/td>\n<td>Local memory<\/td>\n<td colspan=\"2\">Shared memory<\/td>\n<\/tr>\n<tr>\n<td><em>Memory hardware<\/em><\/td>\n<td>Private memory<\/td>\n<td colspan=\"2\">Local memory<\/td>\n<\/tr>\n<tr>\n<td><em>Sofware<\/em><\/td>\n<td>Kernel (or program)<\/td>\n<td colspan=\"2\">Kernel<\/td>\n<\/tr>\n<tr>\n<td><em>Sofware<\/em><\/td>\n<td>Work item<\/td>\n<td colspan=\"2\">Thread<\/td>\n<\/tr>\n<tr>\n<td><em>Sofware<\/em><\/td>\n<td>Work group<\/td>\n<td colspan=\"2\">Block<\/td>\n<\/tr>\n<tr>\n<td><em>Sofware<\/em><\/td>\n<td>NDRange<\/td>\n<td colspan=\"2\">Grid<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\"><strong>Language Features<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>OpenCL<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>CUDA Runtime<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>CUDA Driver<\/strong><\/td>\n<\/tr>\n<tr>\n<td><em>Language base<\/em><\/td>\n<td>C99<\/td>\n<td colspan=\"2\">C++03<\/td>\n<\/tr>\n<tr>\n<td>Format of kernel<\/td>\n<td>string, e.g., a literal or constructed by reading a file<\/td>\n<td colspan=\"2\">C++ function with __kernel__ qualifier<\/td>\n<\/tr>\n<tr>\n<td>Qualifier and type of kernel<\/td>\n<td>__kernel void<\/td>\n<td colspan=\"2\">__kernel__ void<\/td>\n<\/tr>\n<tr>\n<td>Qualifier of non-kernel device functions<\/td>\n<td>Qualifier cannot appear<\/td>\n<td colspan=\"2\">__device__<\/td>\n<\/tr>\n<tr>\n<td>Qualifier of host functions<\/td>\n<td>Host functions cannot appear in OpenCL code<\/td>\n<td colspan=\"2\">__host__<\/td>\n<\/tr>\n<tr>\n<td>Qualifiers for kernel formal parameters<\/td>\n<td>Address qualifiers must be used (global, local, constant)<\/td>\n<td colspan=\"2\">Address qualifiers CANNOT be used<\/td>\n<\/tr>\n<tr>\n<td>Global memory variables<\/td>\n<td>__global<\/td>\n<td colspan=\"2\">__device__<\/td>\n<\/tr>\n<tr>\n<td>Shared memory variables<\/td>\n<td>__local<\/td>\n<td colspan=\"2\">__shared__<\/td>\n<\/tr>\n<tr>\n<td>Constant memory variables<\/td>\n<td>__constant<\/td>\n<td colspan=\"2\">__constant__<\/td>\n<\/tr>\n<tr>\n<td>Private (per task) memory variables<\/td>\n<td>__private or unqualified<\/td>\n<td colspan=\"2\">Qualifiers cannot appear<\/td>\n<\/tr>\n<tr>\n<td>Kernel call syntax<\/td>\n<td>API call in host code, via clEnqueueNDRangeKernel()<\/td>\n<td>Chevron syntax for launch<\/td>\n<td>API call in host code, via cuLaunchGrid(), using PTX assembly code in a string variable<\/td>\n<\/tr>\n<tr>\n<td>Templates available?<\/td>\n<td>No<\/td>\n<td colspan=\"2\">Yes<\/td>\n<\/tr>\n<tr>\n<td>Device memory type in host code<\/td>\n<td>Handle<\/td>\n<td>Pointer, but cannot be dereferenced<\/td>\n<td>Handle<\/td>\n<\/tr>\n<tr>\n<td>Device memory type in device code<\/td>\n<td colspan=\"3\">Pointer<\/td>\n<\/tr>\n<tr>\n<td>Double floating point<\/td>\n<td>Yes (with #pragma only)<\/td>\n<td colspan=\"2\">Yes<\/td>\n<\/tr>\n<tr>\n<td>Thread identification<\/td>\n<td>get_local_id<\/td>\n<td colspan=\"2\">threadIdx<\/td>\n<\/tr>\n<tr>\n<td>Block identification<\/td>\n<td>get_group_id<\/td>\n<td colspan=\"2\">blockIdx<\/td>\n<\/tr>\n<tr>\n<td>Block dimension<\/td>\n<td>get_local_size<\/td>\n<td colspan=\"2\">blockDim<\/td>\n<\/tr>\n<tr>\n<td>Grid dimension<\/td>\n<td>get_num_groups<\/td>\n<td colspan=\"2\">gridDim<\/td>\n<\/tr>\n<tr>\n<td>atomic add\/sub\/&#8230;<\/td>\n<td>AMD, yes with #pragma; No, NVIDIA<\/td>\n<td colspan=\"2\">Yes<\/td>\n<\/tr>\n<tr>\n<td>Thread synchronization within a block<\/td>\n<td>barrier()<\/td>\n<td colspan=\"2\">__syncthreads()<\/td>\n<\/tr>\n<tr>\n<td>Thread synchronization within a block<\/td>\n<td>No equivalence with CUDA<\/td>\n<td colspan=\"2\">__threadfence()<\/td>\n<\/tr>\n<tr>\n<td>Thread synchronization within a block<\/td>\n<td>mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE)<\/td>\n<td colspan=\"2\">__threadfence_block()<\/td>\n<\/tr>\n<tr>\n<td>Thread synchronization within a block<\/td>\n<td>read_mem_fence()<\/td>\n<td colspan=\"2\">No equivalence with OpenCL<\/td>\n<\/tr>\n<tr>\n<td>Thread synchronization within a block<\/td>\n<td>write_mem_fence()<\/td>\n<td colspan=\"2\">No equivalence with OpenCL<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\"><strong>API Features<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>OpenCL<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>CUDA Runtime<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>CUDA Driver<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Initialization<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>cuInit()<\/td>\n<\/tr>\n<tr>\n<td>Kernel creation<\/td>\n<td>clGetPlatformIDs(); clGetDeviceIDs(); clCreateProgramWithSource(); clBuildProgram(); clCreateKernel();<\/td>\n<td>(Generated by CUDA compiler: cudaRegisterFatBinary(); cudaRegisterFunction();<\/td>\n<td>cuDeviceGet(); cuCtxCreate(); cuModuleLoad(); cuModuleGetFunction();<\/td>\n<\/tr>\n<tr>\n<td>Global memory allocation<\/td>\n<td>clCreateBuffer()<\/td>\n<td>cudaMalloc()<\/td>\n<td>cuMemAlloc()<\/td>\n<\/tr>\n<tr>\n<td>Global memory deallocation<\/td>\n<td>clReleaseMemObj()<\/td>\n<td>cudaFree()<\/td>\n<td>cuMemFree()<\/td>\n<\/tr>\n<tr>\n<td>CPU to GPU memory copy<\/td>\n<td>clEnqueueWriteBuffer()<\/td>\n<td>cudaMemcpy<\/td>\n<td>cuMemcpyHtoD()<\/td>\n<\/tr>\n<tr>\n<td>Kernel call from host<\/td>\n<td>clSetKernelArg(); clCreateCommandQueue(); clEnqueueNDRangeKernel();<\/td>\n<td>Chevron syntax&nbsp;for launch, generates cudaConfigureCall(); cudaSetupArgument(); cudaLaunch();<\/td>\n<td>cuLaunchKernel();<\/td>\n<\/tr>\n<tr>\n<td>GPU to CPU memory copy<\/td>\n<td>clEnqueueReadBuffer()<\/td>\n<td>cudaMemcpy<\/td>\n<td>cuMemcpyDtoH()<\/td>\n<\/tr>\n<tr>\n<td>Device information<\/td>\n<td>clGetContextInfo(); clGetDeviceInfo();<\/td>\n<td>cudaGetDeviceProperties()<\/td>\n<td>cuDeviceGetProperties()<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\"><strong>General capabilities and tools<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>OpenCL<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>CUDA Runtime<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>CUDA Driver<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Official, stand-alone compiler<\/td>\n<td>No (but can with clcc)<\/td>\n<td>Yes<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td>JIT compilation<\/td>\n<td>Yes, from OpenCL C<\/td>\n<td>Yes, from PTX<\/td>\n<td>Yes, from PTX<\/td>\n<\/tr>\n<tr>\n<td>Current version<\/td>\n<td>1.1 AMD; 1.0 NVIDIA (still on 1.0 after one year since 1.1 spec has been out); 1.1 Intel<\/td>\n<td colspan=\"2\">4.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div>&nbsp;<\/div>\n<h2>Conclusions<\/h2>\n<p style=\"text-align: justify;\">OpenCL is a standardized framework for programming heterogenous processors; CUDA is a proprietary framework. &nbsp;If you want cross-platform support, the choice is clear, use OpenCL.&nbsp; While some studies show that the OpenCL framework is slower than the CUDA framework [<a href=\"#REF2\">2<\/a>, <a href=\"#REF3\">3<\/a>, <a href=\"#REF4\">4<\/a>], this study found that the OpenCL is faster than the CUDA Runtime API. &nbsp;Further studies should be performed &nbsp;to determine whether these runtime speeds are consistent.<\/p>\n<p style=\"text-align: justify;\">While CUDA C++ as a language is more advanced than OpenCL, both are quite primitive relative to the <a href=\"http:\/\/blogs.msdn.com\/b\/somasegar\/archive\/2011\/06\/15\/targeting-heterogeneity-with-c-amp-and-ppl.aspx\">proposed C++ Accelerated Massive Parallelism language (C++ AMP) by Microsoft<\/a>. &nbsp;In CUDA and OpenCL, kernels must be abstracted into functions. &nbsp;In C++ AMP, code will look much cleaner because parallel for-loops operate on inlined lambda functions, and the memory model relatively hidden.<\/p>\n<p style=\"text-align: justify;\">The microprocessor industry is at the cusp of a major change. &nbsp;While discrete GPUs will still exist for several years to come, most manufacturers are releasing integrated CPU\/GPU chips. &nbsp;These processors share a common bus to memory. &nbsp;So, much of the functionality of CUDA and OpenCL, designed to copy memory from the CPU to\/from the GPU, will be obsolete.&nbsp;<\/p>\n<h2>Code<\/h2>\n<p>Code for this example is here:<\/p>\n<p><a href=\"http:\/\/domemtech.com\/code\/pi-opencl.zip\">MSVC++ 2010 OpenCL solution<\/a><\/p>\n<p><a href=\"http:\/\/domemtech.com\/code\/pi-driver.zip\">MSVC++ 2010 CUDA Driver API solution<\/a><\/p>\n<p><a href=\"http:\/\/domemtech.com\/code\/pi-runtime.zip\">MSVC++ 2010 CUDA Runtime API solution<\/a><\/p>\n<h2>References<\/h2>\n<p style=\"text-align: justify; margin-left: 0.5in;\"><a name=\"REF1\"><\/a>1. Lee, V. W., C. Kim, et al. (2010). Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Proceedings of the 37th annual international symposium on Computer architecture. Saint-Malo, France, ACM: 451-460.<\/p>\n<p style=\"text-align: justify; margin-left: 0.5in;\"><a name=\"REF2\"><\/a>2.&nbsp;Karimi, K., N. G. Dickson, et al. (2010). &quot;A performance comparison of CUDA and OpenCL.&quot; Arxiv preprint arXiv:1005.2581.<\/p>\n<p style=\"text-align: justify; margin-left: 0.5in;\"><a name=\"REF3\"><\/a>3.&nbsp;Danalis, A., G. Marin, et al. (2010). The Scalable Heterogeneous Computing (SHOC) benchmark suite. Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. Pittsburgh, Pennsylvania, ACM: 63-74.<\/p>\n<p style=\"text-align: justify; margin-left: 0.5in;\"><a name=\"REF4\"><\/a>4.&nbsp;Komatsu, K., K. Sato, et al. (2010). Evaluating performance and portability of OpenCL programs.<\/p>\n<div>&nbsp;<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Today&#39;s processors have undergone a huge transformation from those of just&nbsp;10 years ago. &nbsp;CPU manufacturers Intel and AMD (and up and coming CPU designer ARM) have increased processor speed via greater emphasis on superscalar execution, deeper pipelining, branch prediction, out of order execution, and fast multi-level caches. &nbsp;This design philosophy has resulted in faster response &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2011\/07\/28\/opencl-vs-cuda\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;OpenCL vs. CUDA&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/669"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=669"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/669\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=669"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=669"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=669"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}