{"id":1025,"date":"2011-10-04T09:59:27","date_gmt":"2011-10-04T16:59:27","guid":{"rendered":"http:\/\/domemtech.com\/?p=1025"},"modified":"2011-11-04T07:16:51","modified_gmt":"2011-11-04T11:16:51","slug":"c-amp","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2011\/10\/04\/c-amp\/","title":{"rendered":"C++ AMP"},"content":{"rendered":"<p><script type=\"text\/javascript\">\/\/ <![CDATA[\n    var TFN='';var TFA='';var TFI='0';var TFL='0';var tf_RetServer=\"rt.trafficfacts.com\";var tf_SiteId=\"13261gb3549a9cdb3793eba221de4858e9e8f13d0afc42h2\";var tf_ScrServer=document.location.protocol+\"\/\/rt.trafficfacts.com\/tf.php?k=13261gb3549a9cdb3793eba221de4858e9e8f13d0afc42h2;c=s;v=5\";document.write(unescape('%3Cscript type=\"text\/JavaScript\" src=\"'+tf_ScrServer+'\">%3C\/script>'));\n\/\/ ]]><\/script><\/p>\n<p><noscript>&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;img src=&#8221;http:\/\/rt.trafficfacts.com\/ns.php?k=13261gb3549a9cdb3793eba221de4858e9e8f13d0afc42h2&#8243; height=&#8221;1&#8243; width=&#8221;1&#8243; alt=&#8221;&#8221;\/&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;<\/noscript><\/p>\n<p style=\"text-align: justify;\">At the\u00c2\u00a0<a href=\"http:\/\/developer.amd.com\/afds\/pages\/default.aspx\" target=\"_blank\">AMD Fusion Developer Summit 2011 in June<\/a>, Microsoft\u00c2\u00a0announced\u00c2\u00a0<a href=\"http:\/\/blogs.msdn.com\/b\/vcblog\/archive\/2011\/06\/15\/introducing-amp.aspx\" target=\"_blank\">C++\u00c2\u00a0Accelerated Massive Parallelism (C++ AMP)<\/a>, an extension to C++ for\u00c2\u00a0parallel programming of GPU&#8217;s. \u00c2\u00a0This extension, included in Visual Studio 11, would allow developers to program the GPU with language features that are arguably much more powerful than either CUDA or OpenCL. \u00c2\u00a0In comparison, CUDA and OpenCL seem more like <a href=\"http:\/\/en.memory-alpha.org\/wiki\/Stone_knives_and_bearskins#Stone_knives_and_bearskins\" target=\"_blank\">stone knives and bearskins<\/a>.\u00c2\u00a0After what seemed like an long three-month wait, Microsoft has finally released\u00c2\u00a0the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/dd831853(v=VS.110).aspx\" target=\"_blank\">Visual Studio 11 Developer Preview<\/a>, which contains C++\u00c2\u00a0Accelerated Massive Parallelism (C++ AMP), Microsoft&#8217;s vision of GPU programming. \u00c2\u00a0Was it worth the wait?<\/p>\n<p><!--more--><\/p>\n<h2 style=\"text-align: justify;\">Introduction to C++ AMP<\/h2>\n<p style=\"text-align: justify;\">Microsoft&#8217;s C++ AMP is important because it tackles the gap between PRAM algorithms and how they are implemented on the GPU. \u00c2\u00a0But surprisingly, only a few additional features were needed to implement parallel programming in C++ AMP; there are minimal syntax changes. \u00c2\u00a0In order to run C++ AMP, the only requirement is that a DirectX11 card be installed on a Windows 7 or later system. \u00c2\u00a0If it is not, then code running C++ AMP is run sequentially on the CPU.<\/p>\n<p style=\"text-align: justify;\">C++ AMP contains the following, contained in the namespace <em>concurrency<\/em>:<\/p>\n<ul>\n<li><em>accelerator\/accelerator_view<\/em>: Wrappers for the GPU.<\/li>\n<li><em>array\/array_view<\/em>: Wrappers for GPU and CPU memory.<\/li>\n<li><em>index\/<em>grid\/extent<\/em><\/em>: Wrappers for index spaces.<\/li>\n<li>lamda functions: For specifying kernel code.<\/li>\n<li><em>parallel_for_each<\/em>. \u00c2\u00a0With this function, you specify kernel code and grid for code to be run on the GPU.<\/li>\n<li>math\/atomic\/synchronization functions for kernel code.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">An <em><a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/hh350895(v=VS.110).aspx\" target=\"_blank\">accelerator<\/a><\/em> class is a wrapper for the GPU and CPU. (NB: <a href=\"http:\/\/social.msdn.microsoft.com\/Forums\/en-US\/parallelcppnative\/thread\/8f5fc358-c619-4a5a-b86e-d04cc66f0cb9\" target=\"_blank\">You must be running Windows 8 if you are trying to run parallel_for_each.<\/a>) \u00c2\u00a0The <em>accelerator<\/em> class contains methods and properties to determine the size of local memory associated with the accelerator, whether it is associated with a display, the path of the device, the name of the device, and whether it is emulated in the CPU. \u00c2\u00a0An <em><a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/hh350919(v=VS.110).aspx\" target=\"_blank\">accelerator_view<\/a><\/em> is a wrapper for allowing multiple threads to share an <em><em><a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/hh350895(v=VS.110).aspx\" target=\"_blank\">accelerator<\/a><\/em><\/em>.<\/p>\n<p style=\"text-align: justify;\">An <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/hh305242(v=VS.110).aspx\" target=\"_blank\">array<\/a> class is a wrapper for memory on the GPU or CPU. \u00c2\u00a0The data structure can be used to represent a dense N-dimensional matrix. \u00c2\u00a0The class <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/hh305260(v=VS.110).aspx\" target=\"_blank\">array_view<\/a> is similar to array, but data is cached in the GPU on demand. Data is copied from the GPU to CPU when variables of either class are destructed. \u00c2\u00a0Alternatively, the user can call <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/hh416782(v=VS.110).aspx\" target=\"_blank\">array_view.synchronize()<\/a> to explicitly copy the data.<\/p>\n<p style=\"text-align: justify;\">Lambda functions were introduced in <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/dd293608.aspx\" target=\"_blank\">Visual C++ 2010<\/a>, but their real power can only be exploited and appreciated with C++ AMP, especially to the developer who has experience using CUDA or OpenCL. \u00c2\u00a0The best feature of lambda functions is the capture clause. \u00c2\u00a0This allows the program to pass parameters to the kernel without declaring and passing formal parameters. \u00c2\u00a0For example,<\/p>\n<pre>int main()\r\n{\r\n   \/\/ Create a vector object that contains 10 elements.\r\n   vector v;\r\n   for (int i = 0; i &lt; 10; ++i)\r\n   {\r\n      v.push_back(i);\r\n   }\r\n\r\n   \/\/ Count the number of even numbers in the vector by\r\n   \/\/ using the for_each function and a lambda expression.\r\n   int evenCount = 0;\r\n   for_each(v.begin(), v.end(), [&amp;evenCount] (int n) {\r\n      cout &lt;&lt; n;\r\n\r\n      if (n % 2 == 0)\r\n      {\r\n         cout &lt;&lt; \" is even \" &lt;&lt; endl;\r\n\r\n         \/\/ Increment the counter.\r\n         evenCount++;\r\n      }\r\n      else\r\n      {\r\n         cout &lt;&lt; \" is odd \" &lt;&lt; endl;\r\n      }\r\n   });\r\n\r\n   \/\/ Print the count of even numbers to the console.\r\n   cout &lt;&lt; \"There are \" &lt;&lt; evenCount\r\n        &lt;&lt; \" even numbers in the vector.\" &lt;&lt; endl;\r\n}<\/pre>\n<p style=\"text-align: justify;\">In this code, the capture clause &#8220;[&amp;evenCount]&#8221; specifies that evenCount is passed to the lambda function via call-by-reference. \u00c2\u00a0A capture clause &#8220;[=]&#8221; specifies all variables referenced in the lambda function to be passed by call-by-value.<\/p>\n<p style=\"text-align: justify;\">The <em>parallel_for_each<\/em> function calls kernel code (specified by the lambda function) on the GPU. \u00c2\u00a0Parameters to the function include a grid and tile, which correspond naturally to GPU concepts.<\/p>\n<p style=\"text-align: justify;\">In order to understand how PRAM algorithms are implemented in C++ AMP, and to compare the performance of C++ AMP with OpenCL and CUDA, I decided to implement\u00c2\u00a0matrix multiplication using C++ AMP, CUDA and OpenCL and compare the solutions.<\/p>\n<h2 style=\"text-align: justify;\">Matrix multiplication<\/h2>\n<p style=\"text-align: justify;\">I started first with\u00c2\u00a0<a href=\"http:\/\/www.danielmoth.com\/Blog\/Matrix-Multiplication-With-C-AMP.aspx\" target=\"_blank\">Daniel Moth&#8217;s blog on matrix multiplication<\/a>. \u00c2\u00a0In that implementation, Moth uses std:vector&#8217;s to represent the data in a matrix. \u00c2\u00a0Unfortunately, in this and many other implementations, the elements of a matrix is disconnected from the dimensions of the matrix, which I never liked. \u00c2\u00a0In a <a href=\"http:\/\/domemtech.com\/?p=249\" target=\"_blank\">previous post<\/a>, I presented a solution for matrix multiplication in CUDA using a templated class which fixes this problem. \u00c2\u00a0However, in order to\u00c2\u00a0compare the performance of C++ AMP, CUDA, and OpenCL, the solution\u00c2\u00a0that I now present uses a C <em>struct <\/em>to represent a matrix because that is supported in all three platforms, C++ AMP, CUDA and OpenCL.<\/p>\n<pre>typedef struct {\r\n    int rows;\r\n    int cols;\r\n    float _data[];\r\n} Matrix;<\/pre>\n<p style=\"text-align: justify;\">Why? \u00c2\u00a0Although C++ AMP and CUDA implement templated classes, OpenCL is C99 based, and so cannot. \u00c2\u00a0In each solution, I provide a functions to encapsulate access to the matrix:\u00c2\u00a0Data(), which returns a pointer to the data elements contained in an array; Create_Matrix(), which creates the struct of the appropriate size to contain all the data elements; Multiply_Serial(), a function to multiply two matrices on the CPU in a sequential manner, and return the result in a pre-allocated matrix. \u00c2\u00a0Since C++ is only available in Visual Studio 11, and CUDA only available in Visual Studio 2010, the implementations for each platform must be separated into different programs. \u00c2\u00a0Note, each must be built for &#8220;Release&#8221; mode; C++ AMP cannot be debugged easily as explained <a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2011\/09\/19\/vs-11-developer-preview-gotchas-with-c-amp.aspx\" target=\"_blank\">here<\/a>.<\/p>\n<p style=\"text-align: justify;\">One PRAM algorithm for matrix multiplication is given below (adapted from 1):<\/p>\n<pre>MultiplySimple(C, A, B)\r\n    parallel for (p = 0; p &lt; C.rows * C.columns; p += 1)\r\n        r = p \/ C.rows;\r\n        c = p % C.rows;\r\n        sum = 0;\r\n        for (k = 0; k &lt; A.columns; ++k)\r\n            sum += A[r, k] * B[k, c];\r\n        C[r, c] = sum;<\/pre>\n<p style=\"text-align: justify;\">In essence, a thread is created for each element for the result matrix C[r, c]; each thread computes the sum of the product of element A[r, k] * B[k, c].<\/p>\n<p style=\"text-align: justify;\">The implementation of this algorithm in C++ AMP is shown below, lines 73-115. \u00c2\u00a0The implementation selects the first accelerator (i.e., a GPU) from the list of available accelerators on the computer, then uses it to compute the product. \u00c2\u00a0Three <em>array_view<\/em>&#8216;s are declared for GPU memory, one for each matrix (A, B, and C). \u00c2\u00a0The values of the elements of each matrix are copied on demand when the kernel (lines 101-110) are executed on the GPU. \u00c2\u00a0The result of the <em>parallel_for_loop<\/em> (lines 99-111) is stored in matrix C. \u00c2\u00a0Another implementation that uses\u00c2\u00a0an explicit tile of 16 by 16 elements is provided (lines 117-159).<\/p>\n<p style=\"text-align: justify;\">The CUDA and OpenCL implementations are given below. \u00c2\u00a0These implementations are straight forward, except that a tile must be chosen for CUDA and OpenCL implementations.<\/p>\n<p style=\"text-align: justify;\">The run time for each were computed (ten runs with first one thrown out, standard error for sample), and a summary of these data are contained in the table below.<\/p>\n<table>\n<tbody>\n<tr>\n<td>Implementation<\/td>\n<td>Sequential<\/td>\n<td>Implicit<br \/>\n(unspecified tile)<\/td>\n<td>Explicit<br \/>\n(16 x 16 tile)<\/td>\n<td>Shared mem<br \/>\n(16 x 16 tile)<\/td>\n<\/tr>\n<tr>\n<td>C++ AMP<\/td>\n<td>1.317 \u00c2\u00b1 0.006 s<\/td>\n<td>0.035 \u00c2\u00b1 0.008 s<\/td>\n<td>0.030 \u00c2\u00b1 0.001 s<\/td>\n<td>0.015 \u00c2\u00b1 0.002 s<\/td>\n<\/tr>\n<tr>\n<td>CUDA<\/td>\n<td>1.454 \u00c2\u00b1 0.008 s<\/td>\n<td>n.a.<\/td>\n<td>0.046 \u00c2\u00b1\u00c2\u00a00.001 s<\/td>\n<td>0.0150 \u00c2\u00b1 0.0003 s<\/td>\n<\/tr>\n<tr>\n<td>OpenCL<\/td>\n<td>1.448 \u00c2\u00b1 0.003 s<\/td>\n<td>n.a.<\/td>\n<td>0.061 \u00c2\u00b1 0.002 s<\/td>\n<td>0.033 \u00c2\u00b1 0.002 s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\">(Random matrices A (480 x 640) x B (640 x 960) = C (480 x 960) single prec. floats.\u00c2\u00a0Environment: NVIDIA\u00c2\u00a0GeForce GTX 470, an ATI Radeon HD 6450 (not used),\u00c2\u00a0and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.\u00c2\u00a010 runs, first run thrown out (for total of 9). Average \u00c2\u00b1 S.E. of sample.)<\/p>\n<p style=\"text-align: justify;\">In this problem, the C++ AMP implementation is faster than either CUDA or OpenCL. \u00c2\u00a0This is a surprise, because CUDA and OpenCL are lower level abstractions than C++ AMP. \u00c2\u00a0The &#8220;Implicit&#8221; run is a C++ AMP implementation that does not specify the tile. \u00c2\u00a0When run on the GPU (GTX 470), a tile size is set by C++ AMP implicitly. \u00c2\u00a0OpenCL seems to be slower than either C++ AMP or CUDA.<\/p>\n<h3>Implementation in C++ AMP<\/h3>\n<pre lang=\"c\">#include &lt;stdio.h&gt;\r\n#include &lt;stddef.h&gt;\r\n#include &lt;malloc.h&gt;\r\n#include &lt;amp.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;iostream&gt;\r\n\r\ntypedef struct {\r\n    int rows;\r\n    int cols;\r\n    float data[];\r\n} Matrix;\r\n\r\nfloat * Data(Matrix * matrix)\r\n{\r\n    return (float*)&amp;matrix-&gt;data;\r\n}\r\n\r\nMatrix * Create_Matrix(int rows, int cols)\r\n{\r\n    Matrix * ret = (Matrix*)malloc(2 * sizeof(int) + rows * cols * sizeof(float));\r\n    ret-&gt;rows = rows;\r\n    ret-&gt;cols = cols;\r\n    for (int i = 0; i &lt; rows * cols; ++i)\r\n        Data(ret)[i] = 0;\r\n    return ret;\r\n}\r\n\r\nvoid MultiplySerial(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting serial... \";\r\n    _ftime_s(&amp;t1);\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n    for (int gr = 0; gr &lt; hA; ++gr) \/\/ row\r\n        for (int gc = 0; gc &lt; wB; ++gc) { \/\/ col\r\n            float sum = 0;\r\n            for (int k = 0; k &lt; hB; ++k) {\r\n                float a = Data(A)[gr * wA + k];\r\n                float b = Data(B)[k * wB + gc];\r\n                sum += a * b;\r\n            }\r\n            Data(C)[gr * wC + gc] = sum;\r\n        }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\n#define MYSIZE 2000\r\nchar buffer[MYSIZE];\r\nCHAR* wtoc(const WCHAR* Source)\r\n{\r\n    for (int j = 0; j &lt; MYSIZE; ++j)\r\n        buffer[j] = 0;\r\n    int i = 0;\r\n    while(Source[i] != '\\0')\r\n    {\r\n        buffer[i] = (CHAR)Source[i];\r\n        ++i;\r\n        if (i &gt; 2000)\r\n            break;\r\n    }\r\n    return buffer;\r\n}\r\n\r\nvoid MultiplySimple(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::vector&lt;concurrency::accelerator&gt; accelerators = concurrency::get_accelerators();\r\n    std::vector&lt;concurrency::accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        std::cout &lt;&lt; \"Starting simple... \";\r\n        _ftime_s(&amp;t1);\r\n        concurrency::array_view&lt;const float,1&gt; a(hA * wA, Data(A));\r\n        concurrency::array_view&lt;const float,1&gt; b(hB * wB, Data(B));\r\n        concurrency::array_view&lt;concurrency::writeonly&lt;float&gt;,1&gt; c(hC * wC, Data(C));\r\n        concurrency::extent&lt;2&gt; e(hC, wC);\r\n        concurrency::grid&lt;2&gt; g(e);\r\n        concurrency::parallel_for_each(*it, g,\r\n            [=](concurrency::index&lt;2&gt; idx) restrict(direct3d) {\r\n            int gr = idx[0];\r\n            int gc = idx[1];\r\n            float sum = 0.0f;\r\n            for(int k = 0; k &lt; hB; k++)\r\n            {\r\n                float aa = a[gr * wA + k];\r\n                float bb = b[k * wB + gc];\r\n                sum += aa * bb;\r\n            }\r\n            c[gr * wC + gc] = sum;\r\n        });\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid MultiplyExplicitSimple(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    static const int TS = 16;\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::vector&lt;concurrency::accelerator&gt; accelerators = concurrency::get_accelerators();\r\n    std::vector&lt;concurrency::accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        std::cout &lt;&lt; \"Starting explicit... \";\r\n        _ftime_s(&amp;t1);\r\n        concurrency::array_view&lt;const float,1&gt; a(hA * wA, Data(A));\r\n        concurrency::array_view&lt;const float,1&gt; b(hB * wB, Data(B));\r\n        concurrency::array_view&lt;concurrency::writeonly&lt;float&gt;,1&gt; c(hC * wC, Data(C));\r\n        concurrency::extent&lt;2&gt; e(hC, wC);\r\n        concurrency::grid&lt;2&gt; g(e);\r\n        concurrency::parallel_for_each(*it, g.tile&lt;TS,TS&gt;(),\r\n            [=](concurrency::tiled_index&lt;TS,TS&gt; idx) restrict(direct3d) {\r\n            int gr = idx.global[0]; int gc = idx.global[1];\r\n            float sum = 0.0f;\r\n            for(int k = 0; k &lt; hB; k++)\r\n            {\r\n                float aa = a[gr * wA + k];\r\n                float bb = b[k * wB + gc];\r\n                sum += aa * bb;\r\n            }\r\n            c[gr * wC + gc] = sum;\r\n        });\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid MultiplyTile(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    static const int TS = 16;\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting tile... \";\r\n    _ftime_s(&amp;t1);\r\n    std::vector&lt;concurrency::accelerator&gt; accelerators = concurrency::get_accelerators();\r\n    std::vector&lt;concurrency::accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        concurrency::array_view&lt;const float,1&gt; a(hA * wA, Data(A));\r\n        concurrency::array_view&lt;const float,1&gt; b(hB * wB, Data(B));\r\n        concurrency::array_view&lt;concurrency::writeonly&lt;float&gt;,1&gt; c(hC * wC, Data(C));\r\n        concurrency::extent&lt;2&gt; e(hC, wC);\r\n        concurrency::grid&lt;2&gt; g(e);\r\n        concurrency::parallel_for_each(*it, g.tile&lt;TS,TS&gt;(),\r\n            [=](concurrency::tiled_index&lt;TS,TS&gt; idx) restrict(direct3d) {\r\n            int lr = idx.local[0]; int lc = idx.local[1];\r\n            int gr = idx.global[0]; int gc = idx.global[1];\r\n\r\n            float sum = 0.0f;\r\n            for (int i = 0; i &lt; hB; i += TS) {\r\n                tile_static float locA[TS][TS], locB[TS][TS];\r\n                locA[lr][lc] = a[gr * wA + lc + i];\r\n                locB[lr][lc] = b[(lr + i) * wB + gc];\r\n                idx.barrier.wait();\r\n\r\n                for (int k = 0; k &lt; TS; k++)\r\n                    sum += locA[lr][k] * locB[k][lc];\r\n                idx.barrier.wait();\r\n            }\r\n            c[gr * wC + gc] = sum;\r\n        });\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nint main()\r\n{\r\n    Matrix * A = Create_Matrix(16*30, 16*40);\r\n    Matrix * B = Create_Matrix(16*40, 16*60);\r\n    Matrix * C = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C2 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C3 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C4 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n\r\n    for (int i = 0; i &lt; 10; ++i)\r\n    {\r\n        for (int i = 0; i &lt; A-&gt;rows * A-&gt;cols; ++i) Data(A)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; B-&gt;rows * B-&gt;cols; ++i) Data(B)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i) Data(C)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C2-&gt;rows * C2-&gt;cols; ++i) Data(C2)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C3-&gt;rows * C3-&gt;cols; ++i) Data(C3)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C4-&gt;rows * C4-&gt;cols; ++i) Data(C4)[i] = rand() % 10;\r\n        MultiplySerial(C, A, B);\r\n        MultiplySimple(C2, A, B);\r\n        MultiplyExplicitSimple(C3, A, B);\r\n        MultiplyTile(C4, A, B);\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i)\r\n            if (fabs(Data(C)[i] - Data(C2)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff C2\\n\";\r\n                break;\r\n            }\r\n        for (int i = 0; i &lt; C3-&gt;rows * C3-&gt;cols; ++i)\r\n            if (fabs(Data(C3)[i] - Data(C3)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff C4\\n\";\r\n                break;\r\n            }\r\n        for (int i = 0; i &lt; C4-&gt;rows * C4-&gt;cols; ++i)\r\n            if (fabs(Data(C)[i] - Data(C4)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff C4\\n\";\r\n                break;\r\n            }\r\n    }\r\n\r\n    return 0;\r\n}<\/pre>\n<h3>Implementation in CUDA<\/h3>\n<pre lang=\"c\">#include &lt;stdlib.h&gt;\r\n#include &lt;stdio.h&gt;\r\n#include &lt;math.h&gt;\r\n#include &lt;cl\/cl.h&gt;\r\n#include &lt;string.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;iomanip&gt;\r\n#include &lt;fstream&gt;\r\n\r\ntypedef struct {\r\n    int rows;\r\n    int cols;\r\n    float _data;\r\n} Matrix;\r\n\r\n__device__ __host__ float * Data(Matrix * matrix)\r\n{\r\n    return (float*)&amp;matrix-&gt;_data;\r\n}\r\n\r\nMatrix * Create_Matrix(int rows, int cols)\r\n{\r\n    Matrix * ret = (Matrix*)malloc(2 * sizeof(int) + rows * cols * sizeof(float));\r\n    ret-&gt;rows = rows;\r\n    ret-&gt;cols = cols;\r\n    for (int i = 0; i &lt; rows * cols; ++i)\r\n        Data(ret)[i] = 0;\r\n    return ret;\r\n}\r\n\r\nvoid MultiplySerial(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting serial... \";\r\n    _ftime_s(&amp;t1);\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n    for (int gr = 0; gr &lt; hA; ++gr) \/\/ row\r\n        for (int gc = 0; gc &lt; wB; ++gc) { \/\/ col\r\n            float sum = 0;\r\n            for (int k = 0; k &lt; hB; ++k) {\r\n                float a = Data(A)[gr * wA + k];\r\n                float b = Data(B)[k * wB + gc];\r\n                sum += a * b;\r\n            }\r\n            Data(C)[gr * wC + gc] = sum;\r\n        }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid Check(bool OK)\r\n{\r\n    if (! OK)\r\n    {\r\n        printf(\"error\\n\");\r\n        throw;\r\n    }\r\n}\r\n\r\n__global__ void kernelSimpleTile(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    int gr = threadIdx.x + blockIdx.x * blockDim.x;\r\n    int gc = threadIdx.y + blockIdx.y * blockDim.y;\r\n\r\n    int hA = A-&gt;rows;\r\n    int wA = A-&gt;cols;\r\n    int wB = B-&gt;cols;\r\n    int wC = C-&gt;cols;\r\n\r\n    float sum = 0.0;\r\n    for (int k = 0; k &lt; wA; ++k)\r\n    {\r\n        float a = Data(A)[gr * wA + k];\r\n        float b = Data(B)[k * wB + gc];\r\n        sum += a * b;\r\n    }\r\n    Data(C)[gr * wC + gc] = sum;\r\n}\r\n\r\nvoid MultiplyExplicitSimple(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    int wTile = 16;\r\n    int hTile = 16;\r\n\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting CUDA simple tile... \";\r\n    _ftime_s(&amp;t1);\r\n\r\n    cudaError_t errcode;\r\n    Matrix * d_A;\r\n    Matrix * d_B;\r\n    Matrix * d_C;\r\n    errcode = cudaMalloc((void**) &amp;d_C, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float));\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMalloc((void**) &amp;d_A, 2 * sizeof(int) + A-&gt;cols * A-&gt;rows * sizeof(float));\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMalloc((void**) &amp;d_B, 2 * sizeof(int) + B-&gt;cols * B-&gt;rows * sizeof(float));\r\n    Check(errcode == cudaSuccess);\r\n\r\n    errcode = cudaMemcpy((void*)d_C, C, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), cudaMemcpyHostToDevice);\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMemcpy((void*)d_A, A, 2 * sizeof(int) + A-&gt;cols * A-&gt;rows * sizeof(float), cudaMemcpyHostToDevice);\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMemcpy((void*)d_B, B, 2 * sizeof(int) + B-&gt;cols * B-&gt;rows * sizeof(float), cudaMemcpyHostToDevice);\r\n    Check(errcode == cudaSuccess);\r\n\r\n    dim3 threads(hTile, wTile);\r\n    dim3 grid(hC \/ hTile, wC \/ wTile);\r\n\r\n    kernelSimpleTile&lt;&lt;&lt;grid, threads&gt;&gt;&gt;(d_C, d_A, d_B);\r\n    cudaThreadSynchronize();\r\n    errcode = cudaGetLastError();\r\n    Check(errcode == cudaSuccess);\r\n\r\n    errcode = cudaMemcpy(C, d_C, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), cudaMemcpyDeviceToHost);\r\n    Check(errcode == cudaSuccess);\r\n\r\n    cudaFree(d_A);\r\n    cudaFree(d_B);\r\n    cudaFree(d_C);\r\n\r\n    _ftime_s(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\n__global__ void kernelTile(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n\r\n    int TS = blockDim.x;\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n#define AS(i, j) As[i][j]\r\n#define BS(i, j) Bs[i][j]\r\n\r\n    int gr = threadIdx.x + blockIdx.x * blockDim.x;\r\n    int gc = threadIdx.y + blockIdx.y * blockDim.y;\r\n    int lr = threadIdx.x;\r\n    int lc = threadIdx.y;\r\n\r\n    float sum = 0;\r\n    for (int i = 0; i &lt; hB; i += TS)\r\n    {\r\n\r\n#define MAX_BLOCK_SIZE 30\r\n        __shared__ float As[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];\r\n        __shared__ float Bs[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];\r\n\r\n        AS(lr, lc) = Data(A)[gr * wA + lc + i];\r\n        BS(lr, lc) = Data(B)[(lr + i) * wB + gc];\r\n        __syncthreads();\r\n\r\n        for (int k = 0; k &lt; TS; k++)\r\n            sum += AS(lr, k) * BS(k, lc);\r\n        __syncthreads();\r\n    }\r\n    Data(C)[gr * wC + gc] = sum;\r\n};\r\n\r\nvoid MultiplyTile(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    int wTile = 16;\r\n    int hTile = 16;\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = wA;\r\n    int wC = wB;\r\n    int hC = hA;\r\n\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting CUDA tile... \";\r\n    _ftime_s(&amp;t1);\r\n\r\n    cudaError_t errcode;\r\n    Matrix * d_A;\r\n    Matrix * d_B;\r\n    Matrix * d_C;\r\n    errcode = cudaMalloc((void**) &amp;d_C, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float));\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMalloc((void**) &amp;d_A, 2 * sizeof(int) + A-&gt;cols * A-&gt;rows * sizeof(float));\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMalloc((void**) &amp;d_B, 2 * sizeof(int) + B-&gt;cols * B-&gt;rows * sizeof(float));\r\n    Check(errcode == cudaSuccess);\r\n\r\n    errcode = cudaMemcpy((void*)d_C, C, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), cudaMemcpyHostToDevice);\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMemcpy((void*)d_A, A, 2 * sizeof(int) + A-&gt;cols * A-&gt;rows * sizeof(float), cudaMemcpyHostToDevice);\r\n    Check(errcode == cudaSuccess);\r\n    errcode = cudaMemcpy((void*)d_B, B, 2 * sizeof(int) + B-&gt;cols * B-&gt;rows * sizeof(float), cudaMemcpyHostToDevice);\r\n    Check(errcode == cudaSuccess);\r\n\r\n    dim3 threads(hTile, wTile);\r\n    dim3 grid(hC \/ hTile, wC \/ wTile);\r\n\r\n    kernelTile&lt;&lt;&lt;grid, threads&gt;&gt;&gt;(d_C, d_A, d_B);\r\n    cudaThreadSynchronize();\r\n    errcode = cudaGetLastError();\r\n    Check(errcode == cudaSuccess);\r\n\r\n    errcode = cudaMemcpy(C, d_C, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), cudaMemcpyDeviceToHost);\r\n    Check(errcode == cudaSuccess);\r\n\r\n    cudaFree(d_A);\r\n    cudaFree(d_B);\r\n    cudaFree(d_C);\r\n\r\n    _ftime_s(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nint main(int argc, char * argv[])\r\n{\r\n    Matrix * A = Create_Matrix(16*30, 16*40);\r\n    Matrix * B = Create_Matrix(16*40, 16*60);\r\n    Matrix * C = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C2 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C3 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n\r\n    for (int i = 0; i &lt; 10; ++i)\r\n    {\r\n        for (int i = 0; i &lt; A-&gt;rows * A-&gt;cols; ++i) Data(A)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; B-&gt;rows * B-&gt;cols; ++i) Data(B)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i) Data(C)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C2-&gt;rows * C2-&gt;cols; ++i) Data(C2)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C3-&gt;rows * C3-&gt;cols; ++i) Data(C3)[i] = rand() % 10;\r\n        MultiplySerial(C, A, B);\r\n        MultiplyExplicitSimple(C2, A, B);\r\n        MultiplyTile(C3, A, B);\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i)\r\n            if (fabs(Data(C)[i] - Data(C2)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff\\n\";\r\n                break;\r\n            }\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i)\r\n            if (fabs(Data(C)[i] - Data(C3)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff\\n\";\r\n                break;\r\n            }\r\n    }\r\n\r\n    return 0;\r\n}<\/pre>\n<h3>Implementation in OpenCL<\/h3>\n<pre lang=\"c\">\/\/ kernel.cl\r\n\/\/ Multiply two matrices A * B = C\r\n\/\/ Device code.\r\n\r\ntypedef struct {\r\n    int rows;\r\n    int cols;\r\n    float _data[];\r\n} Matrix;\r\n\r\n__global float * Data(__global Matrix * matrix)\r\n{\r\n    return (__global float*)&amp;matrix-&gt;_data;\r\n}\r\n\r\n__kernel void kernelSimple(__global Matrix* C, __global Matrix* A, __global Matrix* B)\r\n{\r\n    int i = get_global_id(0);\r\n    int j = get_global_id(1);\r\n    int hA = A-&gt;rows;\r\n    int wA = A-&gt;cols;\r\n    int wB = B-&gt;cols;\r\n    int wC = C-&gt;cols;\r\n    float sum = 0.0;\r\n    for (int k = 0; k &lt; wA; ++k)\r\n    {\r\n        float a = Data(A)[i * wA + k];\r\n        float b = Data(B)[k * wB + j];\r\n        sum += a * b;\r\n    }\r\n    Data(C)[i * wC + j] = sum;\r\n}\r\n\r\n__kernel void kernelTile(__global Matrix * C, __global Matrix * A, __global Matrix * B)\r\n{\r\n\r\n    int TS = get_local_size(0);\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n#define AS(i, j) As[i][j]\r\n#define BS(i, j) Bs[i][j]\r\n\r\n    int gr = get_global_id(0);\r\n    int gc = get_global_id(1);\r\n    int lr = get_local_id(0);\r\n    int lc = get_local_id(1);\r\n\r\n    float sum = 0.0;\r\n    for (int i = 0; i &lt; hB; i += TS)\r\n    {\r\n\r\n#define MAX_BLOCK_SIZE 30\r\n        __local float As[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];\r\n        __local float Bs[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];\r\n\r\n        AS(lr, lc) = Data(A)[gr * wA + lc + i];\r\n        BS(lr, lc) = Data(B)[(lr + i) * wB + gc];\r\n        barrier(CLK_LOCAL_MEM_FENCE);\r\n\r\n        for (int k = 0; k &lt; TS; k++)\r\n            sum += AS(lr, k) * BS(k, lc);\r\n        barrier(CLK_LOCAL_MEM_FENCE);\r\n    }\r\n    Data(C)[gr * wC + gc] = sum;\r\n};<\/pre>\n<pre lang=\"c\">#include &lt;stdlib.h&gt;\r\n#include &lt;stdio.h&gt;\r\n#include &lt;math.h&gt;\r\n#include &lt;cl\/cl.h&gt;\r\n#include &lt;string.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;iomanip&gt;\r\n#include &lt;fstream&gt;\r\n\r\ntypedef struct {\r\n    int rows;\r\n    int cols;\r\n    float _data[];\r\n} Matrix;\r\n\r\nfloat * Data(Matrix * matrix)\r\n{\r\n    return (float*)&amp;matrix-&gt;_data;\r\n}\r\n\r\nMatrix * Create_Matrix(int rows, int cols)\r\n{\r\n    Matrix * ret = (Matrix*)malloc(2 * sizeof(int) + rows * cols * sizeof(float));\r\n    ret-&gt;rows = rows;\r\n    ret-&gt;cols = cols;\r\n    for (int i = 0; i &lt; rows * cols; ++i)\r\n        Data(ret)[i] = 0;\r\n    return ret;\r\n}\r\n\r\nvoid MultiplySerial(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting serial... \";\r\n    _ftime_s(&amp;t1);\r\n    int wA = A-&gt;cols;\r\n    int hA = A-&gt;rows;\r\n    int wB = B-&gt;cols;\r\n    int hB = B-&gt;rows;\r\n    int wC = C-&gt;cols;\r\n    int hC = C-&gt;rows;\r\n    for (int gr = 0; gr &lt; hA; ++gr) \/\/ row\r\n        for (int gc = 0; gc &lt; wB; ++gc) { \/\/ col\r\n            float sum = 0;\r\n            for (int k = 0; k &lt; hB; ++k) {\r\n                float a = Data(A)[gr * wA + k];\r\n                float b = Data(B)[k * wB + gc];\r\n                sum += a * b;\r\n            }\r\n            Data(C)[gr * wC + gc] = sum;\r\n        }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid Check(bool OK)\r\n{\r\n    if (! OK)\r\n    {\r\n        printf(\"error\\n\");\r\n        throw;\r\n    }\r\n}\r\n\r\nvoid get_device_info(cl_device_id device_id, cl_device_info device_info, std::string* value, cl_int * err)\r\n{\r\n    size_t size = 0;\r\n\r\n    \/\/  Get all params for the given platform id, first query their size, then get the actual data\r\n    *err = clGetDeviceInfo(device_id, device_info, 0, NULL, &amp;size);\r\n    value-&gt;resize(size);\r\n    *err = clGetDeviceInfo(device_id, device_info, size, &amp;((*value)[0]), NULL);\r\n}\r\n\r\nvoid get_platform_info(cl_platform_id platform_id, cl_platform_info platform_info, std::string* value, cl_int * err)\r\n{\r\n    ::size_t size = 0;\r\n\r\n    \/\/  Get all params for the given platform id, first query their size, then get the actual data\r\n    *err = clGetPlatformInfo(platform_id, platform_info, 0, NULL, &amp;size);\r\n    value-&gt;resize(size);\r\n    *err = clGetPlatformInfo(platform_id, platform_info, size, &amp;((*value)[0]), NULL);\r\n}\r\n\r\nstd::string LoadProgram(char * file_name)\r\n{\r\n    std::ifstream input_file;\r\n    input_file.open (file_name, std::ios::binary | std::ios::in);\r\n    if (! input_file.is_open())\r\n        return 0;\r\n\r\n    \/\/  Read contents\r\n    std::istreambuf_iterator&lt;char&gt; begin(input_file.rdbuf());\r\n    std::istreambuf_iterator&lt;char&gt; end;\r\n\r\n    \/\/  Store in std::string object\r\n    std::string file_content(begin, end);\r\n\r\n    \/\/  Save source of program.\r\n    return file_content;\r\n}\r\n\r\nvoid MultiplyExplicitSimple(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    cl_context clGPUContext;\r\n    cl_command_queue clCommandQue;\r\n    cl_program clProgram;\r\n    cl_kernel clKernel;\r\n\r\n    size_t dataBytes;\r\n    size_t kernelLength;\r\n    cl_int errcode;\r\n\r\n    cl_mem d_A;\r\n    cl_mem d_B;\r\n    cl_mem d_C;\r\n\r\n    cl_int err = NULL;\r\n    cl_platform_id plat_id[20];\r\n    cl_uint num;\r\n\r\n    err = clGetPlatformIDs(20, plat_id, &amp;num);\r\n    Check(err == 0);\r\n\r\n    for (int p = 0; p &lt; num; ++p)\r\n    {\r\n        cl_platform_id platform = plat_id[p];\r\n\r\n        cl_context_properties props[4];\r\n        props[0] = CL_CONTEXT_PLATFORM;\r\n        props[1] = (cl_context_properties)platform;\r\n        props[2] = 0;\r\n        clGPUContext = clCreateContextFromType(props, CL_DEVICE_TYPE_GPU, NULL, NULL, &amp;errcode);\r\n        Check(errcode == CL_SUCCESS);\r\n\r\n        errcode = clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &amp;dataBytes);\r\n        cl_device_id * clDevices = (cl_device_id *)malloc(dataBytes);\r\n        errcode |= clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, dataBytes, clDevices, NULL);\r\n        Check(errcode == CL_SUCCESS);\r\n\r\n        for (int d = 0; d &lt; dataBytes; ++d)\r\n        {\r\n            struct _timeb  t1;\r\n            struct _timeb  t2;\r\n            std::cout &lt;&lt; \"Starting explicit simple... \";\r\n            _ftime_s(&amp;t1);\r\n\r\n            clCommandQue = clCreateCommandQueue(clGPUContext, clDevices[d], 0, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            d_C = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), C, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n            d_A = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2 * sizeof(int) + A-&gt;cols * A-&gt;rows * sizeof(float), A, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n            d_B = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2 * sizeof(int) + B-&gt;cols * B-&gt;rows * sizeof(float), B, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            std::string clMatrixMul = LoadProgram(\"kernel.cl\");\r\n            Check(clMatrixMul.c_str() != NULL);\r\n            char * kernel_code = (char*)clMatrixMul.c_str();\r\n\r\n            clProgram = clCreateProgramWithSource(clGPUContext, 1, (const char **)&amp;kernel_code, 0, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            errcode = clBuildProgram(clProgram, 0, NULL, NULL, NULL, NULL);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            clKernel = clCreateKernel(clProgram, \"kernelSimple\", &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            size_t localWorkSize[2], globalWorkSize[2];\r\n\r\n            errcode = clSetKernelArg(clKernel, 0, sizeof(cl_mem), (void *)&amp;d_C);\r\n            errcode |= clSetKernelArg(clKernel, 1, sizeof(cl_mem), (void *)&amp;d_A);\r\n            errcode |= clSetKernelArg(clKernel, 2, sizeof(cl_mem), (void *)&amp;d_B);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            localWorkSize[0] = 16;\r\n            localWorkSize[1] = 16;\r\n            globalWorkSize[0] = C-&gt;rows;\r\n            globalWorkSize[1] = C-&gt;cols;\r\n\r\n            errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            errcode = clEnqueueReadBuffer(clCommandQue, d_C, CL_TRUE, 0, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), C, 0, NULL, NULL);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            clReleaseMemObject(d_A);\r\n            clReleaseMemObject(d_C);\r\n            clReleaseMemObject(d_B);\r\n\r\n            free(clDevices);\r\n            clReleaseContext(clGPUContext);\r\n            clReleaseKernel(clKernel);\r\n            clReleaseProgram(clProgram);\r\n            clReleaseCommandQueue(clCommandQue);\r\n\r\n            _ftime_s(&amp;t2);\r\n            std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n\r\n            return;\r\n        }\r\n    }\r\n}\r\n\r\nvoid MultiplyTile(Matrix * C, Matrix * A, Matrix * B)\r\n{\r\n    cl_context clGPUContext;\r\n    cl_command_queue clCommandQue;\r\n    cl_program clProgram;\r\n    cl_kernel clKernel;\r\n\r\n    size_t dataBytes;\r\n    size_t kernelLength;\r\n    cl_int errcode;\r\n\r\n    cl_mem d_A;\r\n    cl_mem d_B;\r\n    cl_mem d_C;\r\n\r\n    cl_int err = NULL;\r\n    cl_platform_id plat_id[20];\r\n    cl_uint num;\r\n\r\n    err = clGetPlatformIDs(20, plat_id, &amp;num);\r\n    Check(err == 0);\r\n\r\n    for (int p = 0; p &lt; num; ++p)\r\n    {\r\n        cl_platform_id platform = plat_id[p];\r\n\r\n        cl_context_properties props[4];\r\n        props[0] = CL_CONTEXT_PLATFORM;\r\n        props[1] = (cl_context_properties)platform;\r\n        props[2] = 0;\r\n        clGPUContext = clCreateContextFromType(props, CL_DEVICE_TYPE_GPU, NULL, NULL, &amp;errcode);\r\n        Check(errcode == CL_SUCCESS);\r\n\r\n        errcode = clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &amp;dataBytes);\r\n        cl_device_id * clDevices = (cl_device_id *)malloc(dataBytes);\r\n        errcode |= clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, dataBytes, clDevices, NULL);\r\n        Check(errcode == CL_SUCCESS);\r\n\r\n        for (int d = 0; d &lt; dataBytes; ++d)\r\n        {\r\n            struct _timeb  t1;\r\n            struct _timeb  t2;\r\n            std::cout &lt;&lt; \"Starting tile... \";\r\n            _ftime_s(&amp;t1);\r\n\r\n            clCommandQue = clCreateCommandQueue(clGPUContext, clDevices[d], 0, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            d_C = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), C, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n            d_A = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2 * sizeof(int) + A-&gt;cols * A-&gt;rows * sizeof(float), A, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n            d_B = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 2 * sizeof(int) + B-&gt;cols * B-&gt;rows * sizeof(float), B, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            std::string clMatrixMul = LoadProgram(\"kernel.cl\");\r\n            Check(clMatrixMul.c_str() != NULL);\r\n            char * kernel_code = (char*)clMatrixMul.c_str();\r\n\r\n            clProgram = clCreateProgramWithSource(clGPUContext, 1, (const char **)&amp;kernel_code, 0, &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            errcode = clBuildProgram(clProgram, 0, NULL, NULL, NULL, NULL);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            clKernel = clCreateKernel(clProgram, \"kernelTile\", &amp;errcode);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            size_t localWorkSize[2], globalWorkSize[2];\r\n\r\n            errcode = clSetKernelArg(clKernel, 0, sizeof(cl_mem), (void *)&amp;d_C);\r\n            errcode |= clSetKernelArg(clKernel, 1, sizeof(cl_mem), (void *)&amp;d_A);\r\n            errcode |= clSetKernelArg(clKernel, 2, sizeof(cl_mem), (void *)&amp;d_B);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            localWorkSize[0] = 16;\r\n            localWorkSize[1] = 16;\r\n            globalWorkSize[0] = C-&gt;rows;\r\n            globalWorkSize[1] = C-&gt;cols;\r\n\r\n            errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            errcode = clEnqueueReadBuffer(clCommandQue, d_C, CL_TRUE, 0, 2 * sizeof(int) + C-&gt;cols * C-&gt;rows * sizeof(float), C, 0, NULL, NULL);\r\n            Check(errcode == CL_SUCCESS);\r\n\r\n            clReleaseMemObject(d_A);\r\n            clReleaseMemObject(d_C);\r\n            clReleaseMemObject(d_B);\r\n\r\n            free(clDevices);\r\n            clReleaseContext(clGPUContext);\r\n            clReleaseKernel(clKernel);\r\n            clReleaseProgram(clProgram);\r\n            clReleaseCommandQueue(clCommandQue);\r\n\r\n            _ftime_s(&amp;t2);\r\n            std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n\r\n            return;\r\n        }\r\n    }\r\n}\r\n\r\nint\r\nmain(int argc, char** argv)\r\n{\r\n    Matrix * A = Create_Matrix(16*30, 16*40);\r\n    Matrix * B = Create_Matrix(16*40, 16*60);\r\n    Matrix * C = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C2 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n    Matrix * C4 = Create_Matrix(A-&gt;rows, B-&gt;cols);\r\n\r\n    for (int i = 0; i &lt; 10; ++i)\r\n    {\r\n        for (int i = 0; i &lt; A-&gt;rows * A-&gt;cols; ++i) Data(A)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; B-&gt;rows * B-&gt;cols; ++i) Data(B)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i) Data(C)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C2-&gt;rows * C2-&gt;cols; ++i) Data(C2)[i] = rand() % 10;\r\n        for (int i = 0; i &lt; C4-&gt;rows * C4-&gt;cols; ++i) Data(C4)[i] = rand() % 10;\r\n        MultiplySerial(C, A, B);\r\n        MultiplyExplicitSimple(C2, A, B);\r\n        MultiplyTile(C4, A, B);\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i)\r\n            if (fabs(Data(C)[i] - Data(C2)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff C2\\n\";\r\n                break;\r\n            }\r\n        for (int i = 0; i &lt; C-&gt;rows * C-&gt;cols; ++i)\r\n            if (fabs(Data(C)[i] - Data(C4)[i]) &gt; 0.0001)\r\n            { std::cout &lt;&lt; \"diff C4\\n\";\r\n                break;\r\n            }\r\n    }\r\n    return 0;\r\n}<\/pre>\n<p>&nbsp;<\/p>\n<h2>Bitonic sort<\/h2>\n<p style=\"text-align: justify;\">A second problem I chose for comparison was <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bitonic_sorter\" target=\"_blank\">bitonic sort<\/a>\u00c2\u00a0of an integer array. \u00c2\u00a0An in-place, power of 2 sequential solution is outlined in Gedik <em>et al<\/em>., but I did not choose that as a starting point. \u00c2\u00a0While there is a CUDA implementation available in the\u00c2\u00a0<a href=\"http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/Projects\/sortingNetworks.zip\" target=\"_blank\">public domain<\/a>\u00c2\u00a0by NVIDIA, that solution is difficult to understand, and the implementation changes the algorithm used based on the size of the array. \u00c2\u00a0My implementation is based on the description in Peters <em>et al<\/em>. \u00c2\u00a0The C++ AMP and CUDA implementations are presented below.<\/p>\n<p style=\"text-align: justify;\">Both solutions performs a sequence of parallel computations of groups of comparisons and swaps of a uni-directional (also known as &#8220;normalized&#8221;) sorting network. \u00c2\u00a0An example of this network is displayed in the figure below, for an array of 16 elements (figure from <a href=\"http:\/\/en.wikipedia.org\/wiki\/File:BitonicSort.svg\" target=\"_blank\">Wikipedia<\/a>).<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/domemtech.com\/wordpress\/wp-content\/uploads\/2011\/10\/2000px-BitonicSort1.png\"><img loading=\"lazy\" class=\"aligncenter size-large wp-image-1069\" title=\"2000px-BitonicSort\" src=\"http:\/\/domemtech.com\/wordpress\/wp-content\/uploads\/2011\/10\/2000px-BitonicSort1-1024x390.png\" alt=\"\" width=\"584\" height=\"222\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">For the sort, each group of vertically oriented swaps in the network are parallelized. \u00c2\u00a0As shown in the figure above, each blue box is labelled with a <em>phase, and\u00c2\u00a0<\/em>each red box is labelled with a <em>level<\/em>. \u00c2\u00a0(The red boxes in each phase is also known as a bitonic merge, because that portion of the network merges two bitonic sequences into one bitonic sequence.) \u00c2\u00a0Orange boxes are unlabeled because there is only one per blue box. \u00c2\u00a0(The orange box is also know as a bitonic split, because it creates two bitonic sequences.) \u00c2\u00a0Swaps proceed in a left to right fashion, first with the orange box of Phase 1, then the orange box of Phase 2, then the red box of Phase 2, and so on.<\/p>\n<p style=\"text-align: justify;\">In the C++ AMP implementation, swaps in blue box are computed in lines 77-99. \u00c2\u00a0For each blue box, the orange box is computed (lines 78-86), followed by a number of red boxes (lines 88-99). \u00c2\u00a0In this solution, the size of the tile is TS. \u00c2\u00a0The length of the array must be a power of 2, and the length must be divisible by the size of the tile.<\/p>\n<p style=\"text-align: justify;\">The run time for each were computed (nine runs, standard error for sample), and a summary of these data are contained in the table below.<\/p>\n<table>\n<tbody>\n<tr>\n<td>Implementation<\/td>\n<td>Sequential<\/td>\n<td>Parallel (1024 threads per tile)<\/td>\n<\/tr>\n<tr>\n<td>C++ AMP<\/td>\n<td>12.5 \u00c2\u00b1 0.4 s<\/td>\n<td>0.41 \u00c2\u00b1 0.04 s<\/td>\n<\/tr>\n<tr>\n<td>CUDA<\/td>\n<td>12.6 \u00c2\u00b1 0.01 s<\/td>\n<td>0.372 \u00c2\u00b1\u00c2\u00a00.002 s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\">(Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.\u00c2\u00a0Environment: NVIDIA\u00c2\u00a0GeForce GTX 470, an ATI Radeon HD 6450 (not used),\u00c2\u00a0and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.\u00c2\u00a010 runs, first run thrown out (for total of 9). Average \u00c2\u00b1 S.E. of sample.)<\/p>\n<p style=\"text-align: justify;\">For this problem, the C++ AMP implementation is on par with the CUDA implementation (if slightly slower than the CUDA implementation).<\/p>\n<p style=\"text-align: justify;\">The naive bitionic sort implementation performs one comparison per thread. \u00c2\u00a0Each thread loads the two elements to compare, then swaps the values if they are out of order. \u00c2\u00a0 However, global memory is accessed quite often with this implementation because the same elements are loaded from global memory over and over, between levels of the bitonic merge. \u00c2\u00a0For example, element 0 of the array is loaded three times in the bitonic merge in phase 3: the first comparisons of level 2, 1, and 0.<\/p>\n<p style=\"text-align: justify;\">An improved implementation is possible by using fewer threads, with each thread doing more work. \u00c2\u00a0Each thread works on a partition of comparisons of the bitionic merge portion of the network. \u00c2\u00a0The partition is defined so that it does not overlap with other threads working on the same levels of a phase. \u00c2\u00a0Partitions that involve <em>n<\/em> levels of a merge are known as partitions with <em>degree<\/em> <em>n<\/em>. \u00c2\u00a0The figure below shows two partitions. \u00c2\u00a0One partition works with comparisons labelled with red lines; a second partition works with comparisons labelled with green lines. \u00c2\u00a0Both partitions have degree 2 because they involve only two levels of the merge.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/domemtech.com\/wordpress\/wp-content\/uploads\/2011\/10\/2000px-BitonicSort-partitioned.png\"><img loading=\"lazy\" class=\"aligncenter size-large wp-image-1105\" title=\"2000px-BitonicSort-partitioned\" src=\"http:\/\/domemtech.com\/wordpress\/wp-content\/uploads\/2011\/10\/2000px-BitonicSort-partitioned-1024x390.png\" alt=\"\" width=\"584\" height=\"222\" \/><\/a>For comparison, I implemented bitonic sort with partitions of degree 5 (see Peters paper). \u00c2\u00a0The runtimes for C++ AMP and CUDA showed an improvement, particularly with the CUDA implementation.<\/p>\n<table>\n<tbody>\n<tr>\n<td>Implementation<\/td>\n<td>Naive parallel (512 threads per tile)<\/td>\n<td>Degree-5 parallel (512 threads per tile)<\/td>\n<\/tr>\n<tr>\n<td>C++ AMP<\/td>\n<td>0.285 \u00c2\u00b1 0.4 s<\/td>\n<td>0.239 \u00c2\u00b1 0.001 s<\/td>\n<\/tr>\n<tr>\n<td>CUDA<\/td>\n<td>0.280 \u00c2\u00b1 0.001 s<\/td>\n<td>0.187 \u00c2\u00b1\u00c2\u00a00.001 s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\">What is particularly interesting is that, while the C++ AMP implementation is slower than the CUDA implementation, the C++ AMP implementation with degree-5 merges performed better than its naive implementation. \u00c2\u00a0Why? \u00c2\u00a0In C++ AMP, there is no way for the programmer to choose a register representation of variables. \u00c2\u00a0However, the compiler chose a register representation none the less. \u00c2\u00a0It it hadn&#8217;t, the the implementation would have been slower than the naive implementation, which is was not.<\/p>\n<p style=\"text-align: justify;\"><span class=\"Apple-style-span\" style=\"font-size: 10px; letter-spacing: 1px; line-height: 26px; text-transform: uppercase;\">Implementation in CUDA<\/span><\/p>\n<pre lang=\"c\">#include &lt;fstream&gt;\r\n#include &lt;iomanip&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;math.h&gt;\r\n#include &lt;stdio.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;string.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n\r\nvoid exchange(int &amp; i, int &amp; j)\r\n{\r\n    int t = i;\r\n    i = j;\r\n    j = t;\r\n}\r\n\r\nvoid bitonicSortSequentialGedik(int * a, int length)\r\n{\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting sequential (Gedik)... \";\r\n    _ftime_s(&amp;t1);\r\n    int i,j,k;\r\n    for (k = 2; k &lt;= length; k = 2*k)\r\n    {\r\n        for (j = k &gt;&gt; 1; j &gt; 0; j = j&gt;&gt;1)\r\n        {\r\n            for (i=0; i &lt; length; i++)\r\n            {\r\n                int ixj = i ^ j;\r\n                if (ixj &gt; i)\r\n                {\r\n                    if ((i &amp; k) == 0 &amp;&amp; a[i] &gt; a[ixj])\r\n                        exchange(a[i], a[ixj]);\r\n                    if ((i &amp; k) != 0 &amp;&amp; a[i] &lt; a[ixj])\r\n                        exchange(a[i], a[ixj]);\r\n                }\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\n\/\/__device__ __host__ inline void swap(int &amp; a, int &amp; b)\r\n\/\/{\r\n\/\/    if (a &lt; b)\r\n\/\/        return;\r\n\/\/    int tmp = a;\r\n\/\/    a = b;\r\n\/\/    b = tmp;\r\n\/\/}\r\n\/\/__device__ __host__ inline void swap(int &amp; a, int &amp; b)\r\n\/\/{\r\n\/\/    if (a &gt; b) {\r\n\/\/        int tmp = a;\r\n\/\/        a = b;\r\n\/\/        b = tmp;\r\n\/\/    }\r\n\/\/}\r\n\r\n__device__ __host__ inline void swap(int &amp; a, int &amp; b)\r\n{\r\n    if (a &gt; b) {\r\n        a ^= b;\r\n        b ^= a;\r\n        a ^= b;\r\n    }\r\n}\r\n\r\nunsigned int mylog2(unsigned int v)\r\n{\r\n    unsigned int x = 0;\r\n    while ((v = (v &gt;&gt; 1)) != 0)\r\n    {\r\n        x++;\r\n    }\r\n    return x;\r\n}\r\n\r\nunsigned int pow2(int e)\r\n{\r\n    unsigned int x = 1;\r\n    for (int i = 0; i &lt; e; ++i)\r\n        x *= 2;\r\n    return x;\r\n}\r\n\r\n__device__ __host__ inline void orange_box(int i, int p2, int &amp; cross, int &amp; pair)\r\n{\r\n    int imp2 = i % p2;\r\n    int j = imp2 + 2 * p2 * (int)(i \/ p2);\r\n    cross = j;\r\n    pair = -1 - imp2 + 2 * p2 * (int)((i+p2)\/p2);\r\n}\r\n\r\n__device__ __host__ inline void red_box(int i, int p2, int &amp; cross, int &amp; pair)\r\n{\r\n    int imp2 = i % p2;\r\n    int j = imp2 + 2 * p2 * (int)(i \/ p2);\r\n    cross = j;\r\n    pair = j + p2;\r\n}\r\n\r\nvoid bitonicSortSequential(int * data, int length)\r\n{\r\n    unsigned int log2length = mylog2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting sequential... \";\r\n    _ftime_s(&amp;t1);\r\n\r\n    {\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            for (int ig = 0; ig &lt; compares; ++ig) {\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(data[cross], data[paired]);\r\n            }\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                for (int ig = 0; ig &lt; compares; ++ig) {\r\n                    int cross;\r\n                    int paired;\r\n                    red_box(ig, level2, cross, paired);\r\n                    swap(data[cross], data[paired]);\r\n                }\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid Check(bool OK)\r\n{\r\n    if (! OK)\r\n    {\r\n        printf(\"error\\n\");\r\n        throw;\r\n    }\r\n}\r\n\r\n__global__ void Kernel1(int * a, int length, int phase2)\r\n{\r\n    int ig = threadIdx.x\r\n            + blockDim.x * blockIdx.x\r\n            + blockDim.x * gridDim.x * blockDim.y * blockIdx.y\r\n            + blockDim.x * gridDim.x * threadIdx.y;\r\n    if (ig &lt; 0)\r\n        return;\r\n    if (ig &gt;= length)\r\n        return;\r\n    int cross;\r\n    int paired;\r\n    orange_box(ig, phase2, cross, paired);\r\n    swap(a[cross], a[paired]);\r\n}\r\n\r\n__global__ void Kernel2(int * a, int length, int phase2)\r\n{\r\n    int ig = threadIdx.x\r\n            + blockDim.x * blockIdx.x\r\n            + blockDim.x * gridDim.x * blockDim.y * blockIdx.y\r\n            + blockDim.x * gridDim.x * threadIdx.y;\r\n    if (ig &lt; 0)\r\n        return;\r\n    if (ig &gt;= length)\r\n        return;\r\n    int cross;\r\n    int paired;\r\n    red_box(ig, phase2, cross, paired);\r\n    swap(a[cross], a[paired]);\r\n}\r\n\r\nvoid bitonicSortExplicitSimple(int * data, int length)\r\n{\r\n    unsigned int log2length = mylog2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting explicit simple... \";\r\n    _ftime_s(&amp;t1);\r\n\r\n    {\r\n        static const int TS = 512;\r\n        cudaError_t errcode;\r\n        int * a;\r\n        errcode = cudaMalloc((void**) &amp;a, length * sizeof(int));\r\n        Check(errcode == cudaSuccess);\r\n\r\n        errcode = cudaMemcpy((void*)a, data, length * sizeof(int), cudaMemcpyHostToDevice);\r\n        Check(errcode == cudaSuccess);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int threads = length \/ 2;\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n\r\n            dim3 th(TS);\r\n            dim3 gr(threads \/ TS, 1);\r\n\r\n            Kernel1&lt;&lt;&lt;gr, th&gt;&gt;&gt;(a, length, phase2);\r\n            cudaThreadSynchronize();\r\n            errcode = cudaGetLastError();\r\n            Check(errcode == cudaSuccess);\r\n\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                Kernel2&lt;&lt;&lt;gr, th&gt;&gt;&gt;(a, length, level2);\r\n                cudaThreadSynchronize();\r\n                errcode = cudaGetLastError();\r\n                Check(errcode == cudaSuccess);\r\n            }\r\n        }\r\n        errcode = cudaMemcpy((void**)data, a, length * sizeof(int), cudaMemcpyDeviceToHost);\r\n        Check(errcode == cudaSuccess);\r\n        cudaFree(a);\r\n    }\r\n\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid PrecomputePartition(int phase, int level, int degree, int ** _memory, int ** _compares, int ** _normalized_compares, int ** _levels, int * _msize, int * _csize, int * _threads, int * _block_size)\r\n{\r\n    \/\/std::cout &lt;&lt; \"\\nPartitioning with phase \" &lt;&lt; phase &lt;&lt; \", level \" &lt;&lt; level &lt;&lt; \", degree \" &lt;&lt; degree &lt;&lt; \"\\n\";\r\n    if (degree == 3)\r\n    {\r\n        const int msize = 8;\r\n        const int csize = 24;\r\n        static int normalized_compares[csize] = {0, 4, 1, 5, 2, 6, 3, 7,\r\n                                                 0, 2, 1, 3, 4, 6, 5, 7,\r\n                                                 0, 1, 2, 3, 4, 5, 6, 7};\r\n        static int memory[msize];\r\n        for (int i = 0; i &lt; msize; ++i)\r\n            memory[i] = i * pow2(level + 1 - degree);\r\n        static int levels[csize];\r\n        for (int i = 0; i &lt; csize; ++i)\r\n            levels[i] = level - i \/ degree;\r\n        static int compares[csize];\r\n        for (int i = 0; i &lt; csize; ++i)\r\n            compares[i] = memory[normalized_compares[i]];\r\n        int threads = memory[1];\r\n        int block_size = memory[1] * msize;\r\n\r\n        *_msize = msize;\r\n        *_csize = csize;\r\n        *_normalized_compares = normalized_compares;\r\n        *_memory = memory;\r\n        *_levels = levels;\r\n        *_compares = compares;\r\n        *_threads = threads;\r\n        *_block_size = block_size;\r\n    } else if (degree == 5)\r\n    {\r\n        const int msize = 32;\r\n        const int csize = 160;\r\n        static int normalized_compares[csize] = {\r\n    0, 16, 1, 17, 2, 18, 3, 19, 4, 20,  5, 21,  6, 22,  7, 23,  8, 24,  9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31,\r\n    0,  8, 1,  9, 2, 10, 3, 11, 4, 12,  5, 13,  6, 14,  7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31,\r\n    0,  4, 1,  5, 2,  6, 3,  7, 8, 12,  9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 24, 28, 25, 29, 26, 30, 27, 31,\r\n    0,  2, 1,  3, 4,  6, 5,  7, 8, 10,  9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27, 28, 30, 29, 31,\r\n    0,  1, 2,  3, 4,  5, 6,  7, 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};\r\n        static int memory[msize];\r\n        for (int i = 0; i &lt; msize; ++i)\r\n            memory[i] = i * pow2(level + 1 - degree);\r\n        static int levels[csize];\r\n        for (int i = 0; i &lt; csize; ++i)\r\n            levels[i] = level - i \/ degree;\r\n        static int compares[csize];\r\n        for (int i = 0; i &lt; csize; ++i)\r\n            compares[i] = memory[normalized_compares[i]];\r\n        int threads = memory[1];\r\n        int block_size = memory[1] * msize;\r\n\r\n    }\r\n}\r\n\r\n__constant__ const int normalized_compares[160] = {\r\n    0, 16, 1, 17, 2, 18, 3, 19, 4, 20,  5, 21,  6, 22,  7, 23,  8, 24,  9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31,\r\n    0,  8, 1,  9, 2, 10, 3, 11, 4, 12,  5, 13,  6, 14,  7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31,\r\n    0,  4, 1,  5, 2,  6, 3,  7, 8, 12,  9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 24, 28, 25, 29, 26, 30, 27, 31,\r\n    0,  2, 1,  3, 4,  6, 5,  7, 8, 10,  9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27, 28, 30, 29, 31,\r\n    0,  1, 2,  3, 4,  5, 6,  7, 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};\r\n\r\n__global__ void Kernel3(int * a, int length, int phase, int level, int level2)\r\n{\r\n    int ig = threadIdx.x\r\n            + blockDim.x * blockIdx.x\r\n            + blockDim.x * gridDim.x * blockDim.y * blockIdx.y\r\n            + blockDim.x * gridDim.x * threadIdx.y;\r\n    if (ig &lt; 0)\r\n        return;\r\n    if (ig &gt;= length)\r\n        return;\r\n    const int degree = 5;\r\n    const int msize = 32;\r\n    const int csize = 160;\r\n    int memory[msize];\r\n    for (int i = 0; i &lt; msize; ++i)\r\n        memory[i] = i * level2;\r\n    int threads = memory[1];\r\n    int block_size = memory[1] * msize;\r\n    int base = ig % threads + block_size * (int)(ig \/ threads);\r\n    int mm[msize];\r\n    #pragma unroll\r\n    for (int i = 0; i &lt; msize; ++i)\r\n    {\r\n        mm[i] = a[base + memory[i]];\r\n    }\r\n\r\n\/\/#define SCREW_YOU\r\n#ifdef SCREW_YOU\r\n    #pragma unroll\r\n    for (int i = 0; i &lt; csize; i += 2)\r\n    {\r\n        int cross = normalized_compares[i];\r\n        int paired = normalized_compares[i+1];\r\n        swap(mm[cross], mm[paired]);\r\n    }\r\n#else\r\n    swap(mm[0], mm[16]);\r\n    swap(mm[1], mm[17]);\r\n    swap(mm[2], mm[18]);\r\n    swap(mm[3], mm[19]);\r\n    swap(mm[4], mm[20]);\r\n    swap(mm[5], mm[21]);\r\n    swap(mm[6], mm[22]);\r\n    swap(mm[7], mm[23]);\r\n    swap(mm[8], mm[24]);\r\n    swap(mm[9], mm[25]);\r\n    swap(mm[10], mm[26]);\r\n    swap(mm[11], mm[27]);\r\n    swap(mm[12], mm[28]);\r\n    swap(mm[13], mm[29]);\r\n    swap(mm[14], mm[30]);\r\n    swap(mm[15], mm[31]);\r\n\r\n    swap(mm[0], mm[8]);\r\n    swap(mm[1], mm[9]);\r\n    swap(mm[2], mm[10]);\r\n    swap(mm[3], mm[11]);\r\n    swap(mm[4], mm[12]);\r\n    swap(mm[5], mm[13]);\r\n    swap(mm[6], mm[14]);\r\n    swap(mm[7], mm[15]);\r\n    swap(mm[16], mm[24]);\r\n    swap(mm[17], mm[25]);\r\n    swap(mm[18], mm[26]);\r\n    swap(mm[19], mm[27]);\r\n    swap(mm[20], mm[28]);\r\n    swap(mm[21], mm[29]);\r\n    swap(mm[22], mm[30]);\r\n    swap(mm[23], mm[31]);\r\n\r\n    swap(mm[0], mm[4]);\r\n    swap(mm[1], mm[5]);\r\n    swap(mm[2], mm[6]);\r\n    swap(mm[3], mm[7]);\r\n    swap(mm[8], mm[12]);\r\n    swap(mm[9], mm[13]);\r\n    swap(mm[10], mm[14]);\r\n    swap(mm[11], mm[15]);\r\n    swap(mm[16], mm[20]);\r\n    swap(mm[17], mm[21]);\r\n    swap(mm[18], mm[22]);\r\n    swap(mm[19], mm[23]);\r\n    swap(mm[24], mm[28]);\r\n    swap(mm[25], mm[29]);\r\n    swap(mm[26], mm[30]);\r\n    swap(mm[27], mm[31]);\r\n\r\n    swap(mm[0], mm[2]);\r\n    swap(mm[1], mm[3]);\r\n    swap(mm[4], mm[6]);\r\n    swap(mm[5], mm[7]);\r\n    swap(mm[8], mm[10]);\r\n    swap(mm[9], mm[11]);\r\n    swap(mm[12], mm[14]);\r\n    swap(mm[13], mm[15]);\r\n    swap(mm[16], mm[18]);\r\n    swap(mm[17], mm[19]);\r\n    swap(mm[20], mm[22]);\r\n    swap(mm[21], mm[23]);\r\n    swap(mm[24], mm[26]);\r\n    swap(mm[25], mm[27]);\r\n    swap(mm[28], mm[30]);\r\n    swap(mm[29], mm[31]);\r\n\r\n    swap(mm[0], mm[1]);\r\n    swap(mm[2], mm[3]);\r\n    swap(mm[4], mm[5]);\r\n    swap(mm[6], mm[7]);\r\n    swap(mm[8], mm[9]);\r\n    swap(mm[10], mm[11]);\r\n    swap(mm[12], mm[13]);\r\n    swap(mm[14], mm[15]);\r\n    swap(mm[16], mm[17]);\r\n    swap(mm[18], mm[19]);\r\n    swap(mm[20], mm[21]);\r\n    swap(mm[22], mm[23]);\r\n    swap(mm[24], mm[25]);\r\n    swap(mm[26], mm[27]);\r\n    swap(mm[28], mm[29]);\r\n    swap(mm[30], mm[31]);\r\n#endif\r\n\r\n    #pragma unroll\r\n    for (int i = 0; i &lt; msize; ++i)\r\n    {\r\n        a[base + memory[i]] = mm[i];\r\n    }\r\n}\r\n\r\nvoid bitonicSortWithDegree(int * data, int length)\r\n{\r\n    unsigned int log2length = mylog2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting with degree... \";\r\n    _ftime_s(&amp;t1);\r\n    {\r\n        static const int TS = 512;\r\n        cudaError_t errcode;\r\n        int * a;\r\n        errcode = cudaMalloc((void**) &amp;a, length * sizeof(int));\r\n        Check(errcode == cudaSuccess);\r\n\r\n        errcode = cudaMemcpy((void*)a, data, length * sizeof(int), cudaMemcpyHostToDevice);\r\n        Check(errcode == cudaSuccess);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int threads = length \/ 2;\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n\r\n            dim3 th(TS);\r\n            dim3 gr(threads \/ TS, 1);\r\n\r\n            Kernel1&lt;&lt;&lt;gr, th&gt;&gt;&gt;(a, length, phase2);\r\n            cudaThreadSynchronize();\r\n            errcode = cudaGetLastError();\r\n            Check(errcode == cudaSuccess);\r\n\r\n            for (int level = phase-1; level &gt;= 0; )\r\n            {\r\n                if (level &gt; 6)\r\n                {\r\n                    const int degree = 5;\r\n                    const int msize = 32;\r\n                    const int csize = 160;\r\n                    static int memory[msize];\r\n                    for (int i = 0; i &lt; 2; ++i)\r\n                        memory[i] = i * pow2(level + 1 - degree);\r\n                    int threads = memory[1];\r\n                    int block_size = memory[1] * msize;\r\n\r\n                    \/\/ The size of each block is\r\n                    int ncompares = threads * (length \/ block_size);\r\n\r\n                    dim3 th(TS);\r\n                    if (ncompares &lt; TS)\r\n                    throw;\r\n                    dim3 gr(ncompares \/ TS, 1);\r\n                    int l2 = pow2(level + 1 - degree);\r\n                    Kernel3&lt;&lt;&lt;gr, th&gt;&gt;&gt;(a, length, phase, level, l2);\r\n                    cudaThreadSynchronize();\r\n                    errcode = cudaGetLastError();\r\n                    Check(errcode == cudaSuccess);\r\n\r\n                    level -= 5;\r\n                }\r\n                else\r\n                {\r\n                    unsigned int level2 = pow2((unsigned int)level);\r\n                    Kernel2&lt;&lt;&lt;gr, th&gt;&gt;&gt;(a, length, level2);\r\n                    cudaThreadSynchronize();\r\n                    errcode = cudaGetLastError();\r\n                    Check(errcode == cudaSuccess);\r\n                    level -= 1;\r\n                }\r\n            }\r\n        }\r\n        errcode = cudaMemcpy((void**)data, a, length * sizeof(int), cudaMemcpyDeviceToHost);\r\n        Check(errcode == cudaSuccess);\r\n        cudaFree(a);\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nint main()\r\n{\r\n    int size = 1024 * 32 * 16 * 16;\r\n    int * A = (int*)malloc(size * sizeof(int));\r\n    int * A2 = (int*)malloc(size * sizeof(int));\r\n    int * A3 = (int*)malloc(size * sizeof(int));\r\n    int * A4 = (int*)malloc(size * sizeof(int));\r\n    for (int j = 0; j &lt; 10; ++j)\r\n    {\r\n        for (int i = 0; i &lt; size; ++i) A[i] = (size - i);\r\n        for (int i = 0; i &lt; size; ++i) A2[i] = A[i];\r\n        for (int i = 0; i &lt; size; ++i) A3[i] = A[i];\r\n        for (int i = 0; i &lt; size; ++i) A4[i] = A[i];\r\n        bitonicSortSequentialGedik(A, size);\r\n        bitonicSortSequential(A2, size);\r\n        bitonicSortExplicitSimple(A3, size);\r\n        bitonicSortWithDegree(A4, size);\r\n        for (int i = 0; i &lt; size; ++i)\r\n            if (A[i] != A2[i])\r\n            {\r\n                std::cout &lt;&lt; \"diff A2.\\n\";\r\n                break;\r\n            }\r\n        for (int i = 0; i &lt; size; ++i)\r\n            if (A[i] != A3[i])\r\n            {\r\n                std::cout &lt;&lt; \"diff A3.\\n\";\r\n                break;\r\n            }\r\n        for (int i = 0; i &lt; size; ++i)\r\n            if (A[i] != A4[i])\r\n            {\r\n                std::cout &lt;&lt; \"diff A4.\\n\";\r\n                break;\r\n            }\r\n    }\r\n    return 0;\r\n}<\/pre>\n<p><span class=\"Apple-style-span\" style=\"font-size: 10px; letter-spacing: 1px; line-height: 26px; text-transform: uppercase;\">Implementation in C++ AMP<\/span><\/p>\n<pre lang=\"c\">#include &lt;stdlib.h&gt;\r\n#include &lt;stdio.h&gt;\r\n#include &lt;iostream&gt;\r\n#include &lt;amp.h&gt;\r\n#include &lt;sys\/timeb.h&gt;\r\n#include &lt;time.h&gt;\r\n#include &lt;stack&gt;\r\n#include &lt;list&gt;\r\n\r\nusing namespace concurrency;\r\n\r\nvoid exchange(int &amp; i, int &amp; j)\r\n{\r\n    int t = i;\r\n    i = j;\r\n    j = t;\r\n}\r\n\r\nvoid bitonicSortSequentialGedik(int * a, int length)\r\n{\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting sequential (Gedik)... \";\r\n    _ftime_s(&amp;t1);\r\n    int i,j,k;\r\n    for (k = 2; k &lt;= length; k = 2*k)\r\n    {\r\n        for (j = k &gt;&gt; 1; j &gt; 0; j = j&gt;&gt;1)\r\n        {\r\n            for (i=0; i &lt; length; i++)\r\n            {\r\n                int ixj = i ^ j;\r\n                if (ixj &gt; i)\r\n                {\r\n                    if ((i &amp; k) == 0 &amp;&amp; a[i] &gt; a[ixj])\r\n                        exchange(a[i], a[ixj]);\r\n                    if ((i &amp; k) != 0 &amp;&amp; a[i] &lt; a[ixj])\r\n                        exchange(a[i], a[ixj]);\r\n                }\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\ninline void swap(int &amp; a, int &amp; b) restrict(direct3d, cpu)\r\n{\r\n    if (a &lt; b)\r\n        return;\r\n    int tmp = a;\r\n    a = b;\r\n    b = tmp;\r\n}\r\n\r\nunsigned int log2(unsigned int v)\r\n{\r\n    unsigned int x = 0;\r\n    while ((v = (v &gt;&gt; 1)) != 0)\r\n    {\r\n        x++;\r\n    }\r\n    return x;\r\n}\r\n\r\nunsigned int pow2(int e) restrict(direct3d, cpu)\r\n{\r\n    unsigned int x = 1;\r\n    for (int i = 0; i &lt; e; ++i)\r\n        x *= 2;\r\n    return x;\r\n}\r\n\r\ninline void orange_box(int i, int p2, int &amp; cross, int &amp; pair) restrict(direct3d, cpu)\r\n{\r\n    int imp2 = i % p2;\r\n    int j = imp2 + 2 * p2 * (int)(i \/ p2);\r\n    cross = j;\r\n    pair = -1 - imp2 + 2 * p2 * (int)((i+p2)\/p2);\r\n}\r\n\r\ninline void orange_box_phase(int i, int phase, int &amp; cross, int &amp; pair) restrict(direct3d, cpu)\r\n{\r\n    int p2 = pow2(phase);\r\n    int imp2 = i % p2;\r\n    int j = imp2 + 2 * p2 * (int)(i \/ p2);\r\n    cross = j;\r\n    pair = -1 - imp2 + 2 * p2 * (int)((i+p2)\/p2);\r\n}\r\n\r\ninline void red_box(int i, int p2, int &amp; cross, int &amp; pair) restrict(direct3d, cpu)\r\n{\r\n    int imp2 = i % p2;\r\n    int j = imp2 + 2 * p2 * (int)(i \/ p2);\r\n    cross = j;\r\n    pair = j + p2;\r\n}\r\n\r\ninline void red_box_level(int i, int level, int &amp; cross, int &amp; pair) restrict(direct3d, cpu)\r\n{\r\n    int p2 = pow2(level);\r\n    int imp2 = i % p2;\r\n    int j = imp2 + 2 * p2 * (int)(i \/ p2);\r\n    cross = j;\r\n    pair = j + p2;\r\n}\r\n\r\n\/\/ compute the COEX index for the index of the element in the array.\r\nint reverse_red_box(int level, int element)\r\n{\r\n    int a = pow2(level);\r\n    int b = element % a;\r\n    int c = a * (int)(element \/ (2 * a));\r\n    return b + c;\r\n}\r\n\r\n\/\/ Return the COEX of an index into the array.\r\nint reverse_orange_box(int phase, int element)\r\n{\r\n    phase++;\r\n    int c = element % pow2(phase-1);\r\n    int d = (int)(abs((int)((element + pow2(phase-1))\/pow2(phase-1))));\r\n    int e = d % 2;\r\n    int f = (d - 1) % 2;\r\n    int g = (int)(abs((int)((element + pow2(phase))\/pow2(phase))));\r\n    int h = (g-1)*pow2(phase-1);\r\n    int i = pow2(phase-1) - 1 - element % pow2(phase-1);\r\n    int j = e*c+f*i;\r\n    int k = j+h;\r\n    return k;\r\n}\r\n\r\nvoid bitonicSortSequential(int * data, int length)\r\n{\r\n    unsigned int log2length = log2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting sequential... \";\r\n    _ftime_s(&amp;t1);\r\n\r\n    {\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            for (int ig = 0; ig &lt; compares; ++ig) {\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(data[cross], data[paired]);\r\n            }\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                for (int ig = 0; ig &lt; compares; ++ig) {\r\n                    int cross;\r\n                    int paired;\r\n                    red_box(ig, level2, cross, paired);\r\n                    swap(data[cross], data[paired]);\r\n                }\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid bitonicSortWithDegree(int * data, int length)\r\n{\r\n    unsigned int log2length = log2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting with degree... \";\r\n    _ftime_s(&amp;t1);\r\n    std::vector&lt;accelerator&gt; accelerators = get_accelerators();\r\n    std::vector&lt;accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        static const int TS = 512;\r\n        array_view&lt;int,1&gt; a(length, data);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            extent&lt;1&gt; e(compares);\r\n            grid&lt;1&gt; g(e);\r\n\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            parallel_for_each(*it, g.tile&lt;TS&gt;(),\r\n                [phase2, a](tiled_index&lt;TS&gt; idx) restrict(direct3d) {\r\n                int ig = idx.global[0];\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(a[cross], a[paired]);\r\n            });\r\n\r\n            for (int level = phase-1; level &gt;= 0; )\r\n            {\r\n                if (level &gt; 6)\r\n                {\r\n                    const int degree = 5;\r\n                    const int msize = 32;\r\n                    const int csize = 160;\r\n                    static int memory[msize];\r\n                    for (int i = 0; i &lt; 2; ++i)\r\n                        memory[i] = i * pow2(level + 1 - degree);\r\n                    int threads = memory[1];\r\n                    int block_size = memory[1] * msize;\r\n                    int ncompares = threads * (length \/ block_size);\r\n                    int l2 = pow2(level + 1 - degree);\r\n                    extent&lt;1&gt; e(ncompares);\r\n                    grid&lt;1&gt; g(e);\r\n                    parallel_for_each(*it, g.tile&lt;TS&gt;(),\r\n                        [a, length, phase, level, l2](tiled_index&lt;TS&gt; idx) restrict(direct3d)\r\n                    {\r\n                        int ig = idx.global[0];\r\n                        const int degree = 5;\r\n                        const int msize = 32;\r\n                        const int csize = 160;\r\n                        int memory[msize];\r\n                        for (int i = 0; i &lt; msize; ++i)\r\n                            memory[i] = i * l2;\r\n                        int threads = memory[1];\r\n                        int block_size = memory[1] * msize;\r\n                        int base = ig % threads + block_size * (int)(ig \/ threads);\r\n                        int mm[msize];\r\n                        for (int i = 0; i &lt; msize; ++i)\r\n                        {\r\n                            mm[i] = a[base + memory[i]];\r\n                        }\r\n#define XXX\r\n#ifdef XXX\r\n int normalized_compares[160] = {\r\n    0, 16, 1, 17, 2, 18, 3, 19, 4, 20,  5, 21,  6, 22,  7, 23,  8, 24,  9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31,\r\n    0,  8, 1,  9, 2, 10, 3, 11, 4, 12,  5, 13,  6, 14,  7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31,\r\n    0,  4, 1,  5, 2,  6, 3,  7, 8, 12,  9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 24, 28, 25, 29, 26, 30, 27, 31,\r\n    0,  2, 1,  3, 4,  6, 5,  7, 8, 10,  9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27, 28, 30, 29, 31,\r\n    0,  1, 2,  3, 4,  5, 6,  7, 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};\r\n\r\n                        for (int i = 0; i &lt; csize; i += 2)\r\n                        {\r\n                            int cross = normalized_compares[i];\r\n                            int paired = normalized_compares[i+1];\r\n                            swap(mm[cross], mm[paired]);\r\n                        }\r\n#else\r\n                        swap(mm[0], mm[16]);\r\n                        swap(mm[1], mm[17]);\r\n                        swap(mm[2], mm[18]);\r\n                        swap(mm[3], mm[19]);\r\n                        swap(mm[4], mm[20]);\r\n                        swap(mm[5], mm[21]);\r\n                        swap(mm[6], mm[22]);\r\n                        swap(mm[7], mm[23]);\r\n                        swap(mm[8], mm[24]);\r\n                        swap(mm[9], mm[25]);\r\n                        swap(mm[10], mm[26]);\r\n                        swap(mm[11], mm[27]);\r\n                        swap(mm[12], mm[28]);\r\n                        swap(mm[13], mm[29]);\r\n                        swap(mm[14], mm[30]);\r\n                        swap(mm[15], mm[31]);\r\n\r\n                        swap(mm[0], mm[8]);\r\n                        swap(mm[1], mm[9]);\r\n                        swap(mm[2], mm[10]);\r\n                        swap(mm[3], mm[11]);\r\n                        swap(mm[4], mm[12]);\r\n                        swap(mm[5], mm[13]);\r\n                        swap(mm[6], mm[14]);\r\n                        swap(mm[7], mm[15]);\r\n                        swap(mm[16], mm[24]);\r\n                        swap(mm[17], mm[25]);\r\n                        swap(mm[18], mm[26]);\r\n                        swap(mm[19], mm[27]);\r\n                        swap(mm[20], mm[28]);\r\n                        swap(mm[21], mm[29]);\r\n                        swap(mm[22], mm[30]);\r\n                        swap(mm[23], mm[31]);\r\n\r\n                        swap(mm[0], mm[4]);\r\n                        swap(mm[1], mm[5]);\r\n                        swap(mm[2], mm[6]);\r\n                        swap(mm[3], mm[7]);\r\n                        swap(mm[8], mm[12]);\r\n                        swap(mm[9], mm[13]);\r\n                        swap(mm[10], mm[14]);\r\n                        swap(mm[11], mm[15]);\r\n                        swap(mm[16], mm[20]);\r\n                        swap(mm[17], mm[21]);\r\n                        swap(mm[18], mm[22]);\r\n                        swap(mm[19], mm[23]);\r\n                        swap(mm[24], mm[28]);\r\n                        swap(mm[25], mm[29]);\r\n                        swap(mm[26], mm[30]);\r\n                        swap(mm[27], mm[31]);\r\n\r\n                        swap(mm[0], mm[2]);\r\n                        swap(mm[1], mm[3]);\r\n                        swap(mm[4], mm[6]);\r\n                        swap(mm[5], mm[7]);\r\n                        swap(mm[8], mm[10]);\r\n                        swap(mm[9], mm[11]);\r\n                        swap(mm[12], mm[14]);\r\n                        swap(mm[13], mm[15]);\r\n                        swap(mm[16], mm[18]);\r\n                        swap(mm[17], mm[19]);\r\n                        swap(mm[20], mm[22]);\r\n                        swap(mm[21], mm[23]);\r\n                        swap(mm[24], mm[26]);\r\n                        swap(mm[25], mm[27]);\r\n                        swap(mm[28], mm[30]);\r\n                        swap(mm[29], mm[31]);\r\n\r\n                        swap(mm[0], mm[1]);\r\n                        swap(mm[2], mm[3]);\r\n                        swap(mm[4], mm[5]);\r\n                        swap(mm[6], mm[7]);\r\n                        swap(mm[8], mm[9]);\r\n                        swap(mm[10], mm[11]);\r\n                        swap(mm[12], mm[13]);\r\n                        swap(mm[14], mm[15]);\r\n                        swap(mm[16], mm[17]);\r\n                        swap(mm[18], mm[19]);\r\n                        swap(mm[20], mm[21]);\r\n                        swap(mm[22], mm[23]);\r\n                        swap(mm[24], mm[25]);\r\n                        swap(mm[26], mm[27]);\r\n                        swap(mm[28], mm[29]);\r\n                        swap(mm[30], mm[31]);\r\n#endif\r\n                        for (int i = 0; i &lt; msize; ++i)\r\n                            a[base + memory[i]] = mm[i];\r\n\r\n                    });\r\n                    level -= degree;\r\n                }\r\n                else\r\n                {\r\n                    unsigned int level2 = pow2((unsigned int)level);\r\n                    parallel_for_each(*it, g.tile&lt;TS&gt;(),\r\n                        [level2, a](tiled_index&lt;TS&gt; idx) restrict(direct3d) {\r\n                        int ig = idx.global[0];\r\n                        int cross;\r\n                        int paired;\r\n                        red_box(ig, level2, cross, paired);\r\n                        swap(a[cross], a[paired]);\r\n                    });\r\n                    level -= 1;\r\n                }\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid bitonicSortSimple(int * data, int length)\r\n{\r\n    unsigned int log2length = log2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting simple... \";\r\n    _ftime_s(&amp;t1);\r\n    std::vector&lt;accelerator&gt; accelerators = get_accelerators();\r\n    std::vector&lt;accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        array_view&lt;int,1&gt; a(length, data);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            extent&lt;1&gt; e(compares);\r\n            grid&lt;1&gt; g(e);\r\n\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            parallel_for_each(*it, g,\r\n                [phase2, a](index&lt;1&gt; idx) restrict(direct3d) {\r\n                int ig = idx[0];\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(a[cross], a[paired]);\r\n            });\r\n\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                parallel_for_each(*it, g,\r\n                    [level2, a](index&lt;1&gt; idx) restrict(direct3d) {\r\n                    int ig = idx[0];\r\n                    int cross;\r\n                    int paired;\r\n                    red_box(ig, level2, cross, paired);\r\n                    swap(a[cross], a[paired]);\r\n                });\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid bitonicSortSimpleAcceleratorView(int * data, int length)\r\n{\r\n    unsigned int log2length = log2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting simple acc view... \";\r\n    _ftime_s(&amp;t1);\r\n    std::vector&lt;accelerator&gt; accelerators = get_accelerators();\r\n    std::vector&lt;accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        accelerator_view av = it-&gt;create_view(queuing_mode::deferred);\r\n\r\n        array_view&lt;int,1&gt; a(length, data);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            extent&lt;1&gt; e(compares);\r\n            grid&lt;1&gt; g(e);\r\n\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            parallel_for_each(av, g,\r\n                [phase2, a](index&lt;1&gt; idx) restrict(direct3d) {\r\n                int ig = idx[0];\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(a[cross], a[paired]);\r\n            });\r\n\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                parallel_for_each(av, g,\r\n                    [level2, a](index&lt;1&gt; idx) restrict(direct3d) {\r\n                    int ig = idx[0];\r\n                    int cross;\r\n                    int paired;\r\n                    red_box(ig, level2, cross, paired);\r\n                    swap(a[cross], a[paired]);\r\n                });\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid bitonicSortSimpleMixed(int * data, int length)\r\n{\r\n    unsigned int log2length = log2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting simple mixed... \";\r\n    _ftime_s(&amp;t1);\r\n    std::vector&lt;accelerator&gt; accelerators = get_accelerators();\r\n    std::vector&lt;accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        accelerator_view av = it-&gt;create_view(queuing_mode::deferred);\r\n\r\n        array_view&lt;int,1&gt; a(length, data);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            extent&lt;1&gt; e(compares);\r\n            grid&lt;1&gt; g(e);\r\n\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            parallel_for_each(av, g,\r\n                [phase2, a](index&lt;1&gt; idx) restrict(direct3d) {\r\n                int ig = idx[0];\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(a[cross], a[paired]);\r\n            });\r\n\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                parallel_for_each(*it, g,                                \/\/ BOGUS! SHOULD USE \"av\"\r\n                    [level2, a](index&lt;1&gt; idx) restrict(direct3d) {\r\n                    int ig = idx[0];\r\n                    int cross;\r\n                    int paired;\r\n                    red_box(ig, level2, cross, paired);\r\n                    swap(a[cross], a[paired]);\r\n                });\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid bitonicSortExplicitSimple(int * data, int length)\r\n{\r\n    unsigned int log2length = log2(length);\r\n    unsigned int checklength = pow2(log2length);\r\n    if (length != checklength)\r\n    {\r\n        std::cout &lt;&lt; \"Length not a power of 2.\\n\";\r\n        throw;\r\n    }\r\n    struct _timeb  t1;\r\n    struct _timeb  t2;\r\n    std::cout &lt;&lt; \"Starting explicit simple... \";\r\n    _ftime_s(&amp;t1);\r\n    std::vector&lt;accelerator&gt; accelerators = get_accelerators();\r\n    std::vector&lt;accelerator&gt;::iterator it;\r\n    for (it = accelerators.begin(); it != accelerators.end(); ++it)\r\n    {\r\n        if ((*it).get_has_display() == false)\r\n            continue;\r\n        break;\r\n    }\r\n    {\r\n        static const int TS = 512;\r\n        array_view&lt;int,1&gt; a(length, data);\r\n\r\n        for (int phase = 0; phase &lt; log2length; ++phase)\r\n        {\r\n            int compares = length \/ 2;\r\n            extent&lt;1&gt; e(compares);\r\n            grid&lt;1&gt; g(e);\r\n\r\n            unsigned int phase2 = pow2((unsigned int)phase);\r\n            parallel_for_each(*it, g.tile&lt;TS&gt;(),\r\n                [phase2, a](tiled_index&lt;TS&gt; idx) restrict(direct3d) {\r\n                int ig = idx.global[0];\r\n                int cross;\r\n                int paired;\r\n                orange_box(ig, phase2, cross, paired);\r\n                swap(a[cross], a[paired]);\r\n            });\r\n\r\n            for (int level = phase-1; level &gt;= 0; --level)\r\n            {\r\n                unsigned int level2 = pow2((unsigned int)level);\r\n                parallel_for_each(*it, g.tile&lt;TS&gt;(),\r\n                    [level2, a](tiled_index&lt;TS&gt; idx) restrict(direct3d) {\r\n                    int ig = idx.global[0];\r\n                    int cross;\r\n                    int paired;\r\n                    red_box(ig, level2, cross, paired);\r\n                    swap(a[cross], a[paired]);\r\n                });\r\n            }\r\n        }\r\n    }\r\n    _ftime(&amp;t2);\r\n    std::cout &lt;&lt; (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))\/1000) &lt;&lt; \" s.\\n\";\r\n}\r\n\r\nvoid Diff(int * A, int * A2, int size, std::string name)\r\n{\r\n    for (int i = 0; i &lt; size; ++i)\r\n        if (A[i] != A2[i])\r\n        {\r\n            std::cout &lt;&lt; \"diff \" &lt;&lt; name &lt;&lt; \"\\n\";\r\n            break;\r\n        }\r\n}\r\n\r\nint main()\r\n{\r\n    int size = 1024 * 32 * 16 * 16;\r\n    int * A = (int*)malloc(size * sizeof(int));\r\n    int * A2 = (int*)malloc(size * sizeof(int));\r\n    int * A3 = (int*)malloc(size * sizeof(int));\r\n    int * A4 = (int*)malloc(size * sizeof(int));\r\n    int * A5 = (int*)malloc(size * sizeof(int));\r\n    int * A6 = (int*)malloc(size * sizeof(int));\r\n    for (int j = 0; j &lt; 10; ++j)\r\n    {\r\n        for (int i = 0; i &lt; size; ++i) A[i] = (size - i);\r\n        for (int i = 0; i &lt; size; ++i) A2[i] = A[i];\r\n        for (int i = 0; i &lt; size; ++i) A3[i] = A[i];\r\n        for (int i = 0; i &lt; size; ++i) A4[i] = A[i];\r\n        for (int i = 0; i &lt; size; ++i) A5[i] = A[i];\r\n        for (int i = 0; i &lt; size; ++i) A6[i] = A[i];\r\n        bitonicSortSequentialGedik(A, size);\r\n        bitonicSortSequential(A2, size);\r\n        bitonicSortSimple(A3, size);\r\n        bitonicSortSimpleAcceleratorView(A4, size);\r\n        bitonicSortWithDegree(A5, size);\r\n        bitonicSortExplicitSimple(A6, size);\r\n        Diff(A, A2, size, \"A2\");\r\n        Diff(A, A3, size, \"A3\");\r\n        Diff(A, A4, size, \"A4\");\r\n        Diff(A, A5, size, \"A5\");\r\n        Diff(A, A6, size, \"A6\");\r\n    }\r\n    return 0;\r\n}<\/pre>\n<h2>Conclusions<\/h2>\n<p style=\"text-align: justify;\">C++ AMP is a new, high-level extension of the C++ language for solving problems using parallelism on GPU&#8217;s. \u00c2\u00a0Two problems were examined, with implementations in C++ AMP, CUDA, and OpenCL for comparison. \u00c2\u00a0C++ AMP provides higher level abstractions for the GPU and its memory. \u00c2\u00a0Much of the details of the GPU, however, are hidden from the programmer, which is exactly how programmers get the most out of the GPU. \u00c2\u00a0While the extension looks promising, it is not clear if one trades lower performance for the higher-level abstraction. \u00c2\u00a0Further investigations are needed.<\/p>\n<h2>Source code<\/h2>\n<p style=\"text-align: justify;\">Matrix multiplication: <a href=\"http:\/\/domemtech.com\/code\/Matrix_C_AMP.zip\">C++AMP<\/a> \u00c2\u00a0<a href=\"http:\/\/domemtech.com\/code\/Matrix_CUDA.zip\">CUDA<\/a> \u00c2\u00a0<a href=\"http:\/\/domemtech.com\/code\/Matrix_OpenCL.zip\">OpenCL<\/a><\/p>\n<p style=\"text-align: justify;\">Bitonic sort: \u00c2\u00a0<a href=\"http:\/\/domemtech.com\/code\/Bitonic-C-AMP.zip\">C++AMP<\/a> \u00c2\u00a0<a href=\"http:\/\/domemtech.com\/code\/Bitonic-CUDA.zip\">CUDA<\/a><\/p>\n<p style=\"text-align: justify;\">Meetup presentation (<a href=\"http:\/\/www.meetup.com\/HPC-GPU-Supercomputing-Group-of-Boston\/events\/28657211\/\">http:\/\/www.meetup.com\/HPC-GPU-Supercomputing-Group-of-Boston\/events\/28657211\/<\/a>): <a href=\"http:\/\/domemtech.com\/code\/Computing-with-C++-AMP.pptx\">PPT<\/a><\/p>\n<h2>References<\/h2>\n<p style=\"text-align: justify;\">Gedik, B., R. R. Bordawekar, et al. (2007). CellSort: high performance sorting on the cell processor, VLDB Endowment.<\/p>\n<p style=\"text-align: justify;\">Peters, H., O. Schulz-Hildebrandt, et al. (2010). Fast In-Place Sorting with CUDA Based on Bitonic Sort. Parallel Processing and Applied Mathematics. R. Wyrzykowski, J. Dongarra, K. Karczewski and J. Wasniewski, Springer Berlin \/ Heidelberg. 6067: 403-410.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;img src=&#8221;http:\/\/rt.trafficfacts.com\/ns.php?k=13261gb3549a9cdb3793eba221de4858e9e8f13d0afc42h2&#8243; height=&#8221;1&#8243; width=&#8221;1&#8243; alt=&#8221;&#8221;\/&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; At the\u00c2\u00a0AMD Fusion Developer Summit 2011 in June, Microsoft\u00c2\u00a0announced\u00c2\u00a0C++\u00c2\u00a0Accelerated Massive Parallelism (C++ AMP), an extension to C++ for\u00c2\u00a0parallel programming of GPU&#8217;s. \u00c2\u00a0This extension, included in Visual Studio 11, would allow developers to program the GPU with language features that are arguably much more powerful than either CUDA or OpenCL. \u00c2\u00a0In &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2011\/10\/04\/c-amp\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;C++ AMP&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1025"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=1025"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1025\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=1025"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=1025"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=1025"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}