Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult. Â Companies will not publish design information because they do not want you or other companies to copy the technology. Â But, sometimes you need to know details of a technology that are just not published in order to use it effectively. Â If they won’t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2]. Â What is the performance of OpenCL, CUDA, and C++ AMP? Â What can we learn from this information?

Introduction

Performance of OpenCL, CUDA, and C++ AMP is not easy to compare because there is no one-to-one correspondence between the APIs. Â For example, there is no direct analog of the OpenCL function clSetKernelArg within CUDA. Â In fact, clSetKernelArg can be called once and the kernel called multiple times. Â In CUDA, this is impossible because the kernel argument is set when the kernel is called: a kernel call is translated into internal functions cudaSetupArgument and cudaLaunch.

Moreover, each GPGPU programming language environment performs initialization when a function of the APIÂ called, which is order dependent. For example, the first call to the CUDA API could be a cudaMalloc or cudaSetDevice depending on whether the programmer accepts the default device. Â However, when either function is called, a long wait follows because CUDA is initializing. Â But it is not hopeless if the developer implements the solution carefully.

As it turns out, a block of code in a program written in one language can usually be translated into equivalent code in another GPGPU programming language. Â For example, code to find and select a device in CUDA has an equivalent block of code in OpenCL. Â These blocks of code are called phases. Â What are the phases of a program? Â This depends, but one possible definition follows:

Setup resources on the GPU for problem solving, e.g., pick the GPU, allocate memory on the GPU, and compile kernel code for the GPU.
Copy data structures from the CPU to the GPU.
Call kernel: Run the kernel code to solve the problem on the GPU.
Copy data structuresÂ from the GPU to the CPU.
Release resources on the GPU.

The functions of each language can be grouped into these phases:

	CUDA	OpenCL	C++ AMP
Setup	cudaGetDeviceCount cudaGetDeviceProperties cudaSetDevice cudaMalloc	clGetPlatformIDs clGetPlatformInfo clGetDeviceIDs clGetDeviceInfo clCreateContext clCreateProgramWithSource clBuildProgram clGetProgramBuildInfo clCreateKernel clCreateBuffer clSetKernelArg clCreateCommandQueue	allocate vector< accelerator>get_accelerators() allocateÂ vector::iterator accelerators.begin() accelerators.end() get_description() allocate array<> allocateÂ extent<> allocateÂ grid<>
Copy to GPU	cudaMemcpy	clEnqueueWriteBuffer	copy
Call kernel	kernel<<<…>>>(…) cudaDeviceSynchronize	clEnqueueNDRangeKernel clWaitForEvents	parallel_for_each flush wait
Copy to CPU	cudaMemcpy	clEnqueueReadBuffer	copy
Release	cudaFree	clReleaseMemObject clReleaseKernel clReleaseContext clReleaseProgram	allocate grid<> allocateÂ extent<> allocateÂ array<> deallocateÂ accelerator_view deallocateÂ vector<accelerator> deallocateÂ vector::iterator

Gathering run-time data

How do we collect the run-times for each phase? Â There are a few ways to do this:

Use aÂ black-box testÂ to gather the run-time for the process. Â To do this, run a program within the Bourne shell using the command “time” to display the run-time. Â Unfortunately, this will not show the run-time for each phase.
Use aÂ white-box testÂ using a profilerÂ to gather the run-time for phases. Â But, the programmer must partition the phases into functions because the profiler is phase unaware.
Use a white-box test, implemented with a “self-instrumented” programÂ (i.e., a program in which the programmer inserted calls to a clock function) to gather the run-time for the phases.

To make a fair comparison between environments, the problem and the solution must be the same, using the same algorithm, grid/tile size, memory in the GPU, etc.

Let’s take a simple problem:

input: A = array[0..n] of 1’s; integer u; integer i;

output: O = array[0..n] of integers, where O[k] = u * i when(k % u) = 0, otherwise O[k] = 0.

In other words, sum the number of 1’s in an array of n integers (all 1’s),Â i times, with a block size of u, and place the sum at the beginning of the block. Â For example, given an array of 16 integers (n = 16), u = 4, and i = 4, O[0] = O[4] = O[8] = O[12] = 16, and O[1] = O[2] = O[3] = O[5] = … = 0 (for all other indices). Â Other parameters to tweak the problem are: x iterations of solving the entire problem; stride s distance between elements summed.

The solution of this problem is here. Â The defaults for the problem are: “-u 4 -i 1 -x 1000 -u 8388608 -s 1”. Â The grid size is 256 threads.

Test environment

This program was executed on two machines:

Asus P5N-D motherboard, Intel Q6600 @ 2.51 GHz (overclocked),Â 4 GB DDR2-1066 @ 838 MHz, Windows 7 64-bit OS,Â NVIDIAÂ GeForce GTX 470, an ATI Radeon HD 6450. A list of drivers and versions isÂ here.
Gigabyte GA-A75-UD4H, AMD A8-3850 @ 2.9 GHz, 8 GB DDR3-1600 @ 800 MHz, Windows 8 64-bit OS (Developer Preview).

Results

Setup time before copy to GPU

The time associated to compile an OpenCL program accounts for the large setup time for GPUs in the OpenCL environment. Â Other environments do not have this problem because the code is pre-compiled to an intermediate language.

CUDA (MS)	OPENCL (MS)	C++ AMP (MS)
38.7 Â± 0.4	424 Â± 12	76.8 Â± 1.9

(Run-times on Intel machine, mean Â± S.E. of sample for 5 runs.)

Copying data between GPU and CPU

The run-time for copying data from the CPU to GPU are similar for CUDA, OpenCL, and C++ AMP. Â However, copying data from the GPU to CPU seems to be different, and slower on all platforms, with C++ AMP much slower.

TO/FROM GPU	CUDA (MS)	OPENCL (MS)	C++ AMP (MS)
To	47.1 Â± 1.4	58.2 Â± 5.7	64.7 Â± 1.1
From	72.5 Â±Â 1.9	74.2 Â±Â 2.7	121 Â± 2

(Run-times on Intel machine, mean Â± S.E. of sample for 5 runs.)

Kernel Overhead

On both machines, the run-time for a loop of a varying number of kernel calls in CUDA, OpenCL, and C++ AMP fairly similar, with C++ AMP slightly slower than the other two. Â However, for small problems, C++ AMP shows an overhead around 70 ms associated with the first call to the kernel, whereas the overhead isÂ negligibleÂ for the other two environments.

PROBLEM PARAMETERS	CUDA (MS)	OPENCL (MS)	C++ AMP (MS)
-i 1 -x 1	0.936 Â± 0.005	0.944 Â± 0.017	71.6 Â± 3.7
-i 1 -x 10	6.34 Â± 0.01	6.46 Â±Â 0.01	76.35 Â±Â 0.61
-i 1 -x 100	60.5 Â± 0.1	60.5 Â±Â 0.1	145 Â±Â 1
-i 1 -x 1000	602 Â± 0.1	601 Â±Â 0.3	713 Â±Â 1
-i 1 -x 10000	6022 Â± 0.4	6005 Â±Â 1	6378 Â±Â 3

(Run-times on Intel machine,Â mean Â± S.E. of sample for 5 runs.)

Object code overhead

On both machines, the run-time for one kernel call containing a loop of a varying number of iterations in CUDA, OpenCL, and C++ AMP, are again fairly similar. Â CUDA and OpenCL perform the kernel call with a loop of 10000 iterations around 2.2 s, while C++ AMP is slightly slower than the other two at 3.7 s. Â These data suggest that performance of the object code for C++ AMP is not as good as for OpenCL or CUDA.

PROBLEM PARAMETERS	CUDA (MS)	OPENCL (MS)	C++ AMP (MS)
-i 1 -x 1	0.918 Â± 0.010	0.946 Â± 0.010	71.6 Â± 3.7
-i 10 -x 10	2.87 Â±Â 0.01	2.87 Â±Â 0.02	70.7 Â± 0.5
-i 100 -x 100	22.5 Â±Â 0.02	22.3 Â±Â 0.03	103 Â± 0.3
-i 1000 -x 1000	220 Â±Â 0.1	219 Â±Â 0.1	428 Â± 1
-i 10000 -x 10000	2195 Â± 0.4	2188 Â±Â 0.1	3671 Â±Â 1

(Run-times on Intel machine, mean Â± S.E. of sample for 5 runs.)

Performance comparison between NVIDIA 470 and AMD Llano APU

The relative performance for the GPU’s is NVIDIA 470 >Â AMD Llano APU >> ATI Radeon HD 6450. Â There is no surprise there, considering their cost.

NVIDIA 470 (S)	AMD Llano GPU (S)	ATI Radeon HD 6450 (S)
1.41 Â± 0.01	8.62 Â± 0.02	37.1 Â± 0.01

(Run-times on Intel and AMD machines. Â Program ran with standard parameters, and “-l opencl -q”. Â Mean Â± S.E. of sample for 5 runs.)

Discussion and conclusion

In order to program efficient code in GPU programming environments, developers need to understand the API. Program design requires information that may not be published, or at best, in developer forums (e.g., http://forums.nvidia.com and http://forums.amd.com).Â Experimentation helps identify the information needed by the programmer for good design.

References

1) Wong, H., M. Papadopoulou, et al. (2010). Demystifying GPU microarchitecture through microbenchmarking. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2010, IEEE.

2) Zhang, Y., L. Peng, et al. “Architecture Comparisons between Nvidia and ATI GPUs: Computation Parallelism and Data Communications.”

Performance comparison of CUDA, OpenCL, and C++ AMP

Introduction

Gathering run-time data

Test environment

Results

Setup time before copy to GPU

Copying data between GPU and CPU

Kernel Overhead

Object code overhead

Performance comparison between NVIDIA 470 and AMD Llano APU

Discussion and conclusion

References

Leave a comment

Cancel reply