In CUDA, OpenCL, and C++ AMP, a group is a collection ofÂ threads that execute in parallel in lock-step fashion. Â In CUDA, it is called a block; in OpenCL, it is called aÂ work-group; in C++ AMP, it is called a tile. Â The purpose of a group is to allow threads within the group to communicate with each other using synchronization and/or shared memory.Â The size of thread groups is set by the programmer, but hardware constraints limit the maximum size to 512 or 1024. While programmers usually need to tailor algorithms to be aware of thread groups,Â there are a few tricks that can make programming easier.

An example

Let’s take a concrete example to illustrate some C++ AMP programming tricks: initialize an array of integers with the index of the element in the array.

input:Â AÂ = array[0 ..Â n-1] of integers, whereÂ A[k] = 0, and 0 â‰¤Â kÂ â‰¤ 0;
output:Â AÂ = array[0 ..Â n-1] of integers, whereÂ A[k] =Â k, and 0 â‰¤Â kÂ â‰¤ 0.

Switching from a non-tiled to tiled solution

In C++ AMP, programmers can specify non-tiled solutions for problems that do not require synchronization. Â The initialization problem does not require a tile, so a non-tiled solution is possible.

    int n = 1000000;
    std::vector<int> host_data(n);
    array_view<int, 1> a(n, &host_data[0]);
    parallel_for_each(extent<1>(n), [=](index<1> idx) restrict(amp)
    {
        a[idx] = idx[0];
    });
    a.synchronize();

However, if the programmer wants to optimize the performance, he will very likely need to switch to a tiled solution. When making the switch, the extent<>, index<>, and kernel of the parallel_for_each statement must be changed,

    int n = 1000000;
    std::vector<int> host_data(n);
    array_view<int, 1> a(n, &host_data[0]);
    parallel_for_each(extent<1>(n).tile<1000>(), [=](tiled_index<1000> idx) restrict(amp)
    {
        a[idx] = idx.global[0];
    });
    a.synchronize();

But, there is an easier way to change to a tiled implementation without changing any of the parameters of the parallel_for_each using a templated parallel_for_each.

template <typename Kernel>
void tiled_parallel_for_each(extent<1> e, Kernel kernel)
{
    auto krn = [=] (tiled_index<1000> idx) restrict(amp)
    {
        index i;
        i[0] = idx.global[0];
        kernel(i);
    };
    concurrency::parallel_for_each(e.tile<1000>(), krn);
}

...

    int n = 1000000;
    std::vector<int> host_data(n);
    array_view<int, 1> a(n, &host_data[0]);
    tiled_parallel_for_each(extent<1>(n), [=](int idx) restrict(amp)
    {
        a[idx] = idx[0];
    });
    a.synchronize();

Padding thread groups

When programming a tiled solution using CUDA, OpenCL, or C++ AMP, the programmer must use an integral number of groups. Â In other words, the total number of threads is an integer multiple of the number of groups. Â What happens if the programmer wants a non-integral number of threads?

If nÂ = 1000000, what is the size of the tile (ts), and how many tiles are there, in order to have each thread set one element of the array? Â That depends on the size of the tile, of course. Â If ts = 1000, then 1000 tiles would be needed. Â If tsÂ =Â 500, then 2000 tiles would be needed.

In C++ AMP, the solution is very simple:

    int n = 1000000;
    std::vector<int> host_data(n);
    array_view<int, 1> a(n, &host_data[0]);
    parallel_for_each(extent(n).tile<1000>(), [=](tiled_index<1000> idx) restrict(amp)
    {
        a[idx] = idx.global[0];
    });
    a.synchronize();

But, ifÂ nÂ = 104729 (a prime number), there is no tile size–other than 104729 or 1–which allows for an integral number of tiles to overlay the problem size. Â A tile size which is larger than 1024 is not possible, and a tile size of 1Â uses the GPU inefficiently. Â The solution is to use the number of tiles that covers at least n integers. Â Unfortunately, the kernel must have guards inserted in order to not write past the end of the array.

    int n = 104729;
    std::vector<int> host_data(n);
    array_view<int, 1> a(n, &host_data[0]);
    parallel_for_each(extent(n).tile<1000>().pad(), [=](tiled_index<1000> idx) restrict(amp)
    {
        if (idx.global[0] <= a.extent.size())
            a[idx] = idx.global[0];
    });
    a.synchronize();

But, as we will see below, it is possible to automatically add guards by creating a new memory class.

Automatic guards

When padding an extent for a tiling, the kernel must change to make sure access to global memory does not go out of bounds. Â One way to do this is to have the programmer add bounds checking to the kernel. Â Another way is to create a class derived from array_view<> that performs the checks automatically by overriding the []-operator.

template <typename T, int R>
class xarray_view
    : public array_view<T, R>
{
public:
    xarray_view(int _E0, T * _Src)
        : array_view(_E0, _Src)
    {
    }

public:
    T & operator[] (const int nIndex) const restrict(amp, cpu)
    {
        return array_view::operator[](nIndex);
    }
};

...

    int n = 1000000 - 1;
    std::vector host_data(n);
    xarray_view a(n, &host_data[0]);
    tiled_parallel_for_each(extent(n), [=](int idx) restrict(amp)
    {
        a[idx] = idx[0];
    });
    a.synchronize();

Linearizing a tiled_index<>

If a problem space is very large, tiling cannot be done in only one dimension. Â In this situation,Â programmers need to use a multi-dimensional tile, and convert tiled coordinates into a linear space. Â One solution is to assume a 2-dimensional square matrix. Â The size of each side of the matrix is the square root of the problem size. Â To linearize the space, the index is calculated from the 2-dimensional coordinates using a row-major format.

template <typename Kernel>
void xtiled_parallel_for_each(int ext_size, Kernel kernel)
{
    int size = (int)sqrt(ext_size);

    auto krn = [=] (tiled_index<16, 16> idx) restrict(amp)
    {
        kernel(idx.global[0] * size + idx.global[1]);
    };
    concurrency::parallel_for_each(extent<2>(size, size).tile<16, 16>().pad(), krn);
}

...

    int n = 1024 * 1024;
    std::vector<int> host_data(n);
    array_view<int, 1> a(n, &host_data[0]);
    xtiled_parallel_for_each(n, [=](int idx) restrict(amp)
    {
        a[idx] = idx;
    });
    a.synchronize();

Putting it all together

All the tricks we have seen can be composed. Â Together, they are great helpers for C++ AMP.

template <typename Kernel>
void tiled_parallel_for_each(int ext_size, Kernel kernel)
{
    auto krn = [=] (tiled_index idx) restrict(amp)
    {
        kernel(idx.global[0]);
    };
    concurrency::parallel_for_each(extent(ext_size).tile().pad(), krn);
}

template <typename Kernel>
void xtiled_parallel_for_each(int ext_size, Kernel kernel)
{
    int size = (int)sqrt(ext_size);

    auto krn = [=] (tiled_index idx) restrict(amp)
    {
        kernel(idx.global[0] * size + idx.global[1]);
    };
    concurrency::parallel_for_each(extent(size, size).tile().pad(), krn);
}

template <typename T, int R>
class xarray_view
    : public array_view<T, R>
{
public:
    xarray_view(int _E0, T * _Src)
        : array_view(_E0, _Src)
    {
    }

public:
    T & operator[] (const int nIndex) const restrict(amp, cpu)
    {
        return array_view::operator[](nIndex);
    }
};

References

http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/26/tiled-extent-pad-in-c-amp.aspx

http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/24/simplifying-single-dimensional-c-amp-code.aspx

Syntactic sugar with C++ AMP