Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult. Â Companies will not publish design information because they do not want you or other companies to copy the technology. Â But, sometimes you need to know details of a technology that are just not published in order to use it effectively. Â If they won’t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2]. Â What is the performance of OpenCL, CUDA, and C++ AMP? Â What can we learn from this information?
Introduction
Performance of OpenCL, CUDA, and C++ AMP is not easy to compare because there is no one-to-one correspondence between the APIs. Â For example, there is no direct analog of the OpenCL function clSetKernelArg within CUDA. Â In fact, clSetKernelArg can be called once and the kernel called multiple times. Â In CUDA, this is impossible because the kernel argument is set when the kernel is called: a kernel call is translated into internal functions cudaSetupArgument and cudaLaunch.
Moreover, each GPGPU programming language environment performs initialization when a function of the APIÂ called, which is order dependent. For example, the first call to the CUDA API could be a cudaMalloc or cudaSetDevice depending on whether the programmer accepts the default device. Â However, when either function is called, a long wait follows because CUDA is initializing. Â But it is not hopeless if the developer implements the solution carefully.
As it turns out, a block of code in a program written in one language can usually be translated into equivalent code in another GPGPU programming language. Â For example, code to find and select a device in CUDA has an equivalent block of code in OpenCL. Â These blocks of code are called phases. Â What are the phases of a program? Â This depends, but one possible definition follows:
- Setup resources on the GPU for problem solving, e.g., pick the GPU, allocate memory on the GPU, and compile kernel code for the GPU.
- Copy data structures from the CPU to the GPU.
- Call kernel: Run the kernel code to solve the problem on the GPU.
- Copy data structures from the GPU to the CPU.
- Release resources on the GPU.
CUDA | OpenCL | C++ AMP | |
---|---|---|---|
Setup | cudaGetDeviceCount cudaGetDeviceProperties cudaSetDevice cudaMalloc |
clGetPlatformIDs clGetPlatformInfo clGetDeviceIDs clGetDeviceInfo clCreateContext clCreateProgramWithSource clBuildProgram clGetProgramBuildInfo clCreateKernel clCreateBuffer clSetKernelArg clCreateCommandQueue |
allocate vector< accelerator>get_accelerators() allocate vector::iterator accelerators.begin() accelerators.end() get_description() allocate array<> allocate extent<> allocate grid<> |
Copy to GPU | cudaMemcpy | clEnqueueWriteBuffer | copy |
Call kernel | kernel<<<…>>>(…) cudaDeviceSynchronize |
clEnqueueNDRangeKernel clWaitForEvents |
parallel_for_each flush wait |
Copy to CPU | cudaMemcpy | clEnqueueReadBuffer | copy |
Release | cudaFree | clReleaseMemObject clReleaseKernel clReleaseContext clReleaseProgram |
allocate grid<> allocate extent<> allocate array<> deallocate accelerator_view deallocate vector<accelerator> deallocate vector::iterator |
Gathering run-time data
How do we collect the run-times for each phase? Â There are a few ways to do this:
- Use a black-box test to gather the run-time for the process.  To do this, run a program within the Bourne shell using the command “time” to display the run-time.  Unfortunately, this will not show the run-time for each phase.
- Use a white-box test using a profiler to gather the run-time for phases.  But, the programmer must partition the phases into functions because the profiler is phase unaware.
- Use a white-box test, implemented with a “self-instrumented” program (i.e., a program in which the programmer inserted calls to a clock function) to gather the run-time for the phases.
To make a fair comparison between environments, the problem and the solution must be the same, using the same algorithm, grid/tile size, memory in the GPU, etc.
Let’s take a simple problem:
input: A = array[0..n] of 1’s; integer u; integer i;
output: O = array[0..n] of integers, where O[k] = u * i when(k % u) = 0, otherwise O[k] = 0.
In other words, sum the number of 1’s in an array of n integers (all 1’s), i times, with a block size of u, and place the sum at the beginning of the block.  For example, given an array of 16 integers (n = 16), u = 4, and i = 4, O[0] = O[4] = O[8] = O[12] = 16, and O[1] = O[2] = O[3] = O[5] = … = 0 (for all other indices).  Other parameters to tweak the problem are: x iterations of solving the entire problem; stride s distance between elements summed.
The solution of this problem is here. Â The defaults for the problem are: “-u 4 -i 1 -x 1000 -u 8388608 -s 1”. Â The grid size is 256 threads.
Test environment
This program was executed on two machines:
- Asus P5N-D motherboard, Intel Q6600 @ 2.51 GHz (overclocked), 4 GB DDR2-1066 @ 838 MHz, Windows 7 64-bit OS, NVIDIA GeForce GTX 470, an ATI Radeon HD 6450. A list of drivers and versions is here.
- Gigabyte GA-A75-UD4H, AMD A8-3850 @ 2.9 GHz, 8 GB DDR3-1600 @ 800 MHz, Windows 8 64-bit OS (Developer Preview).
Results
Setup time before copy to GPU
The time associated to compile an OpenCL program accounts for the large setup time for GPUs in the OpenCL environment. Â Other environments do not have this problem because the code is pre-compiled to an intermediate language.
CUDA (MS) | OPENCL (MS) | C++ AMP (MS) |
---|---|---|
38.7 ± 0.4 | 424 ± 12 | 76.8 ± 1.9 |
(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)
Copying data between GPU and CPU
The run-time for copying data from the CPU to GPU are similar for CUDA, OpenCL, and C++ AMP. Â However, copying data from the GPU to CPU seems to be different, and slower on all platforms, with C++ AMP much slower.
TO/FROM GPU | CUDA (MS) | OPENCL (MS) | C++ AMP (MS) |
---|---|---|---|
To | 47.1 ± 1.4 | 58.2 ± 5.7 | 64.7 ± 1.1 |
From | 72.5 ± 1.9 | 74.2 ± 2.7 | 121 ± 2 |
(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)
Kernel Overhead
On both machines, the run-time for a loop of a varying number of kernel calls in CUDA, OpenCL, and C++ AMP fairly similar, with C++ AMP slightly slower than the other two.  However, for small problems, C++ AMP shows an overhead around 70 ms associated with the first call to the kernel, whereas the overhead is negligible for the other two environments.
PROBLEM PARAMETERS | CUDA (MS) | OPENCL (MS) | C++ AMP (MS) |
---|---|---|---|
-i 1 -x 1 | 0.936 ± 0.005 | 0.944 ± 0.017 | 71.6 ± 3.7 |
-i 1 -x 10 | 6.34 ± 0.01 | 6.46 ± 0.01 | 76.35 ± 0.61 |
-i 1 -x 100 | 60.5 ± 0.1 | 60.5 ± 0.1 | 145 ± 1 |
-i 1 -x 1000 | 602 ± 0.1 | 601 ± 0.3 | 713 ± 1 |
-i 1 -x 10000 | 6022 ± 0.4 | 6005 ± 1 | 6378 ± 3 |
(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)
Object code overhead
On both machines, the run-time for one kernel call containing a loop of a varying number of iterations in CUDA, OpenCL, and C++ AMP, are again fairly similar. Â CUDA and OpenCL perform the kernel call with a loop of 10000 iterations around 2.2 s, while C++ AMP is slightly slower than the other two at 3.7 s. Â These data suggest that performance of the object code for C++ AMP is not as good as for OpenCL or CUDA.
PROBLEM PARAMETERS | CUDA (MS) | OPENCL (MS) | C++ AMP (MS) |
---|---|---|---|
-i 1 -x 1 | 0.918 ± 0.010 | 0.946 ± 0.010 | 71.6 ± 3.7 |
-i 10 -x 10 | 2.87 ± 0.01 | 2.87 ± 0.02 | 70.7 ± 0.5 |
-i 100 -x 100 | 22.5 ± 0.02 | 22.3 ± 0.03 | 103 ± 0.3 |
-i 1000 -x 1000 | 220 ± 0.1 | 219 ± 0.1 | 428 ± 1 |
-i 10000 -x 10000 | 2195 ± 0.4 | 2188 ± 0.1 | 3671 ± 1 |
(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)
Performance comparison between NVIDIA 470 and AMD Llano APU
The relative performance for the GPU’s is NVIDIA 470 >Â AMD Llano APU >> ATI Radeon HD 6450. Â There is no surprise there, considering their cost.
NVIDIA 470 (S) | AMD Llano GPU (S) | ATI Radeon HD 6450 (S) |
---|---|---|
1.41 ± 0.01 | 8.62 ± 0.02 | 37.1 ± 0.01 |
(Run-times on Intel and AMD machines.  Program ran with standard parameters, and “-l opencl -q”.  Mean ± S.E. of sample for 5 runs.)
Discussion and conclusion
In order to program efficient code in GPU programming environments, developers need to understand the API. Program design requires information that may not be published, or at best, in developer forums (e.g., http://forums.nvidia.com and http://forums.amd.com). Experimentation helps identify the information needed by the programmer for good design.
References
1) Wong, H., M. Papadopoulou, et al. (2010). Demystifying GPU microarchitecture through microbenchmarking. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2010, IEEE.
2) Zhang, Y., L. Peng, et al. “Architecture Comparisons between Nvidia and ATI GPUs: Computation Parallelism and Data Communications.”