{"id":1211,"date":"2012-03-14T06:21:39","date_gmt":"2012-03-14T10:21:39","guid":{"rendered":"http:\/\/codinggorilla.domemtech.com\/?p=1211"},"modified":"2012-03-14T07:34:22","modified_gmt":"2012-03-14T11:34:22","slug":"syntactic-sugar-with-c-amp","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2012\/03\/14\/syntactic-sugar-with-c-amp\/","title":{"rendered":"Syntactic sugar with C++ AMP"},"content":{"rendered":"<p style=\"text-align: justify;\">In CUDA, OpenCL, and C++ AMP, a <em>group<\/em> is a collection of\u00c2\u00a0threads that execute in parallel in lock-step fashion. \u00c2\u00a0In CUDA, it is called a <em>block<\/em>; in OpenCL, it is called a\u00c2\u00a0<em>work-group<\/em>; in C++ AMP, it is called a <em>tile<\/em>. \u00c2\u00a0The purpose of a group is to allow threads within the group to communicate with each other using synchronization and\/or shared memory.\u00c2\u00a0The size of thread groups is set by the programmer, but hardware constraints limit the maximum size to 512 or 1024. While programmers usually need to tailor algorithms to be aware of thread groups,\u00c2\u00a0there are a few tricks that can make programming easier.<\/p>\n<p style=\"text-align: justify;\"><!--more--><\/p>\n<h1>An example<\/h1>\n<p>Let&#8217;s take a concrete example to illustrate some C++ AMP programming tricks: initialize an array of integers with the index of the element in the array.<\/p>\n<ul>\n<li>input:\u00c2\u00a0<em>A<\/em>\u00c2\u00a0= array[0 ..\u00c2\u00a0<em>n<\/em>-1] of integers, where\u00c2\u00a0<em>A<\/em>[<em>k<\/em>] = 0, and 0 \u00e2\u2030\u00a4\u00c2\u00a0<em>k<\/em>\u00c2\u00a0\u00e2\u2030\u00a4 0;<\/li>\n<li>output:\u00c2\u00a0<em>A<\/em>\u00c2\u00a0= array[0 ..\u00c2\u00a0<em>n<\/em>-1] of integers, where\u00c2\u00a0<em>A<\/em>[<em>k<\/em>] =\u00c2\u00a0<em>k<\/em>, and 0 \u00e2\u2030\u00a4\u00c2\u00a0<em>k<\/em>\u00c2\u00a0\u00e2\u2030\u00a4 0.<\/li>\n<\/ul>\n<h1><\/h1>\n<h1>Switching from a non-tiled to tiled solution<\/h1>\n<p style=\"text-align: justify;\">In C++ AMP, programmers can specify non-tiled solutions for problems that do not require synchronization. \u00c2\u00a0The initialization problem does not require a tile, so a non-tiled solution is possible.<\/p>\n<pre>    int n = 1000000;\r\n    std::vector&lt;int&gt; host_data(n);\r\n    array_view&lt;int, 1&gt; a(n, &amp;host_data[0]);\r\n    parallel_for_each(extent&lt;1&gt;(n), [=](index&lt;1&gt; idx) restrict(amp)\r\n    {\r\n        a[idx] = idx[0];\r\n    });\r\n    a.synchronize();<\/pre>\n<p style=\"text-align: justify;\">However, if the programmer wants to optimize the performance, he will very likely need to switch to a tiled solution. When making the switch, the extent&lt;&gt;, index&lt;&gt;, and kernel of the <em>parallel_for_each<\/em> statement must be changed,<\/p>\n<pre>    int n = 1000000;\r\n    std::vector&lt;int&gt; host_data(n);\r\n    array_view&lt;int, 1&gt; a(n, &amp;host_data[0]);\r\n    parallel_for_each(extent&lt;1&gt;(n).tile&lt;1000&gt;(), [=](tiled_index&lt;1000&gt; idx) restrict(amp)\r\n    {\r\n        a[idx] = idx.global[0];\r\n    });\r\n    a.synchronize();<\/pre>\n<p style=\"text-align: justify;\">But, there is an easier way to change to a tiled implementation without changing any of the parameters of the <em>parallel_for_each <\/em>using a templated <em>parallel_for_each<\/em>.<\/p>\n<pre>template &lt;typename Kernel&gt;\r\nvoid tiled_parallel_for_each(extent&lt;1&gt; e, Kernel kernel)\r\n{\r\n    auto krn = [=] (tiled_index&lt;1000&gt; idx) restrict(amp)\r\n    {\r\n        index i;\r\n        i[0] = idx.global[0];\r\n        kernel(i);\r\n    };\r\n    concurrency::parallel_for_each(e.tile&lt;1000&gt;(), krn);\r\n}\r\n\r\n...\r\n\r\n    int n = 1000000;\r\n    std::vector&lt;int&gt; host_data(n);\r\n    array_view&lt;int, 1&gt; a(n, &amp;host_data[0]);\r\n    tiled_parallel_for_each(extent&lt;1&gt;(n), [=](int idx) restrict(amp)\r\n    {\r\n        a[idx] = idx[0];\r\n    });\r\n    a.synchronize();<\/pre>\n<h1 style=\"text-align: justify;\">Padding thread groups<\/h1>\n<p style=\"text-align: justify;\">When programming a tiled solution using CUDA, OpenCL, or C++ AMP, the programmer must use an integral number of groups. \u00c2\u00a0In other words, the total number of threads is an integer multiple of the number of groups. \u00c2\u00a0What happens if the programmer wants a non-integral number of threads?<\/p>\n<p style=\"text-align: justify;\">If <em>n<\/em>\u00c2\u00a0= 1000000, what is the size of the tile (<em>ts<\/em>), and how many tiles are there, in order to have each thread set one element of the array? \u00c2\u00a0That depends on the size of the tile, of course. \u00c2\u00a0If <em>ts<\/em> = 1000, then 1000 tiles would be needed. \u00c2\u00a0If <em>ts<\/em>\u00c2\u00a0=\u00c2\u00a0500, then 2000 tiles would be needed.<\/p>\n<p style=\"text-align: justify;\">In C++ AMP, the solution is very simple:<\/p>\n<pre>    int n = 1000000;\r\n    std::vector&lt;int&gt; host_data(n);\r\n    array_view&lt;int, 1&gt; a(n, &amp;host_data[0]);\r\n    parallel_for_each(extent(n).tile&lt;1000&gt;(), [=](tiled_index&lt;1000&gt; idx) restrict(amp)\r\n    {\r\n        a[idx] = idx.global[0];\r\n    });\r\n    a.synchronize();<\/pre>\n<p style=\"text-align: justify;\">But, if\u00c2\u00a0<em>n<\/em>\u00c2\u00a0= 104729 (a prime number), there is no tile size&#8211;other than 104729 or 1&#8211;which allows for an integral number of tiles to overlay the problem size. \u00c2\u00a0A tile size which is larger than 1024 is not possible, and a tile size of 1\u00c2\u00a0uses the GPU inefficiently. \u00c2\u00a0The solution is to use the number of tiles that covers <strong><em>at least<\/em><\/strong> <em>n<\/em> integers. \u00c2\u00a0Unfortunately, the kernel must have guards inserted in order to not write past the end of the array.<\/p>\n<pre>    int n = 104729;\r\n    std::vector&lt;int&gt; host_data(n);\r\n    array_view&lt;int, 1&gt; a(n, &amp;host_data[0]);\r\n    parallel_for_each(extent(n).tile&lt;1000&gt;().pad(), [=](tiled_index&lt;1000&gt; idx) restrict(amp)\r\n    {\r\n        if (idx.global[0] &lt;= a.extent.size())\r\n            a[idx] = idx.global[0];\r\n    });\r\n    a.synchronize();<\/pre>\n<p style=\"text-align: justify;\">But, as we will see below, it is possible to automatically add guards by creating a new memory class.<\/p>\n<h1>Automatic guards<\/h1>\n<p style=\"text-align: justify;\">When padding an extent for a tiling, the kernel must change to make sure access to global memory does not go out of bounds. \u00c2\u00a0One way to do this is to have the programmer add bounds checking to the kernel. \u00c2\u00a0Another way is to create a class derived from array_view&lt;&gt; that performs the checks automatically by overriding the []-operator.<\/p>\n<pre>template &lt;typename T, int R&gt;\r\nclass xarray_view\r\n    : public array_view&lt;T, R&gt;\r\n{\r\npublic:\r\n    xarray_view(int _E0, T * _Src)\r\n        : array_view(_E0, _Src)\r\n    {\r\n    }\r\n\r\npublic:\r\n    T &amp; operator[] (const int nIndex) const restrict(amp, cpu)\r\n    {\r\n        return array_view::operator[](nIndex);\r\n    }\r\n};\r\n\r\n...\r\n\r\n    int n = 1000000 - 1;\r\n    std::vector host_data(n);\r\n    xarray_view a(n, &amp;host_data[0]);\r\n    tiled_parallel_for_each(extent(n), [=](int idx) restrict(amp)\r\n    {\r\n        a[idx] = idx[0];\r\n    });\r\n    a.synchronize();<\/pre>\n<h1>Linearizing a tiled_index&lt;&gt;<\/h1>\n<p style=\"text-align: justify;\">If a problem space is very large, tiling cannot be done in only one dimension. \u00c2\u00a0In this situation,\u00c2\u00a0programmers need to use a multi-dimensional tile, and convert tiled coordinates into a linear space. \u00c2\u00a0One solution is to assume a 2-dimensional square matrix. \u00c2\u00a0The size of each side of the matrix is the square root of the problem size. \u00c2\u00a0To linearize the space, the index is calculated from the 2-dimensional coordinates using a row-major format.<\/p>\n<pre>template &lt;typename Kernel&gt;\r\nvoid xtiled_parallel_for_each(int ext_size, Kernel kernel)\r\n{\r\n    int size = (int)sqrt(ext_size);\r\n\r\n    auto krn = [=] (tiled_index&lt;16, 16&gt; idx) restrict(amp)\r\n    {\r\n        kernel(idx.global[0] * size + idx.global[1]);\r\n    };\r\n    concurrency::parallel_for_each(extent&lt;2&gt;(size, size).tile&lt;16, 16&gt;().pad(), krn);\r\n}\r\n\r\n...\r\n\r\n    int n = 1024 * 1024;\r\n    std::vector&lt;int&gt; host_data(n);\r\n    array_view&lt;int, 1&gt; a(n, &amp;host_data[0]);\r\n    xtiled_parallel_for_each(n, [=](int idx) restrict(amp)\r\n    {\r\n        a[idx] = idx;\r\n    });\r\n    a.synchronize();<\/pre>\n<h1>Putting it all together<\/h1>\n<p style=\"text-align: justify;\">All the tricks we have seen can be composed. \u00c2\u00a0Together, they are great helpers for C++ AMP.<\/p>\n<pre>template &lt;typename Kernel&gt;\r\nvoid tiled_parallel_for_each(int ext_size, Kernel kernel)\r\n{\r\n    auto krn = [=] (tiled_index idx) restrict(amp)\r\n    {\r\n        kernel(idx.global[0]);\r\n    };\r\n    concurrency::parallel_for_each(extent(ext_size).tile().pad(), krn);\r\n}\r\n\r\ntemplate &lt;typename Kernel&gt;\r\nvoid xtiled_parallel_for_each(int ext_size, Kernel kernel)\r\n{\r\n    int size = (int)sqrt(ext_size);\r\n\r\n    auto krn = [=] (tiled_index idx) restrict(amp)\r\n    {\r\n        kernel(idx.global[0] * size + idx.global[1]);\r\n    };\r\n    concurrency::parallel_for_each(extent(size, size).tile().pad(), krn);\r\n}\r\n\r\ntemplate &lt;typename T, int R&gt;\r\nclass xarray_view\r\n    : public array_view&lt;T, R&gt;\r\n{\r\npublic:\r\n    xarray_view(int _E0, T * _Src)\r\n        : array_view(_E0, _Src)\r\n    {\r\n    }\r\n\r\npublic:\r\n    T &amp; operator[] (const int nIndex) const restrict(amp, cpu)\r\n    {\r\n        return array_view::operator[](nIndex);\r\n    }\r\n};<\/pre>\n<h1>References<\/h1>\n<p><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/02\/26\/tiled-extent-pad-in-c-amp.aspx\">http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/02\/26\/tiled-extent-pad-in-c-amp.aspx<\/a><\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/01\/24\/simplifying-single-dimensional-c-amp-code.aspx\">http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/01\/24\/simplifying-single-dimensional-c-amp-code.aspx<\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In CUDA, OpenCL, and C++ AMP, a group is a collection of\u00c2\u00a0threads that execute in parallel in lock-step fashion. \u00c2\u00a0In CUDA, it is called a block; in OpenCL, it is called a\u00c2\u00a0work-group; in C++ AMP, it is called a tile. \u00c2\u00a0The purpose of a group is to allow threads within the group to communicate with &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2012\/03\/14\/syntactic-sugar-with-c-amp\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Syntactic sugar with C++ AMP&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1211"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=1211"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/1211\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=1211"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=1211"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=1211"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}