Writing a wrapper for the CUDA memory allocation API

As a software developer, you may be faced with the problem of modifying the behavior of a third-party program. For example, you're writing a debugger that will check the usage of memory allocation in the program being debugged. You often do not have the source for the program, so you cannot change it. And, you do not have the sources for the memory allocation library that the program uses, so you cannot change that code either. This is just the problem I was facing with Nvidia's CUDA API [1]. Although it is not a particularly hard problem, what I learned was yet another example of how frustrating it is to find a solution that should be well described, but is not.

The problem

I have been programming with CUDA C++ SDK for Windows lately, and noticed that I often don't use the API correctly. Some of the oversights I make are:

I forget to check the return value from cudaMalloc. In this example, the return values of the two CUDA API calls are never checked. However, most likely, they would have succeeded, but this is a bad assumption.

...
int h = 1;
int * d;
cudaMalloc(&d, sizeof(h));
cudaMemcpy(d, &h, sizeof(h), cudaMemcpyHostToDevice);

I allocate a buffer with the wrong size. In this example, the size of the variable “d” allocated in device global memory is 10 bytes long, not 40 bytes long, as it should be. In addition, the return values of the two CUDA API calls are never checked.

...
int h[10];
int * d;
cudaMalloc(&d, 10);
cudaMemcpy(d, &h, 10, cudaMemcpyHostToDevice);

I use a host pointer as a device pointer, and vice versa. In this example, the intension was to initialize the device memory, but the parameters to cudaMemcpy are reversed.

...
int h[10];
int * d;
cudaMalloc(&d, 10);
cudaMemcpy(&h, d, 10, cudaMemcpyHostToDevice);

Two developer tools, Nsight [2] and Ocelot [3], seem as though they should help, but do not. Nsight does not detect the first two problems, but does for the third. Nsight also requires you to recompile your program with the -G0 option [1] and relink. Ocelot, which should be able to detect all of these problems, does not exist for Windows. There are no additional tools that seem to help. So, I decided to write a wrapper for the CUDA API memory allocation functions. This would help report errors, and find memory overwrites, unfreed blocks of memory, and incorrect use of pointers with these API functions.

There are several ways to implement this programming task, some solutions easier than others. However, I did not want to recompile and relink my program in order to debug it. I decided wanted a tool that would add wrapper functions for the CUDA API at runtime.

First steps

The first the problem I needed to solve was to write the wrapper for cudaMalloc and cudaFree. I first noticed that cudaMalloc and cudaFree are very much like malloc and free. This similarity reminded me of an old solution for Windows, BoundsChecker [4], which I used many years ago. In those days, BoundsChecker required you to recompile and relink your program using a special include file and library. The include file contained macros for malloc and free, which substituted debugging wrapper functions for malloc and free. A more contemporary example that uses the same solution is Dmalloc [5]. Following this by example, I wrote an include file and library for debugging a CUDA program.

How do I do DLL injection and hooking?

The next step was to find a way to insert this code without the need to recompile and relink my program. Searching Google for “intercepting calls to a DLL windows”, I learned that I would most likely need to use “DLL injection” [6]. “In computer programming, DLL injection is a technique used to run code within the address space of another process by forcing it to load a dynamic-link library.”

But, DLL injection was not sufficient. Even if I was able to copy my wrapper code into another process, each call to cudaMalloc and cudaFree needed to be changed. To do this, I needed to do “hooking” [6]. “In computer programming, the term hooking covers a range of techniques used to alter or augment the behavior of anoperating system, of applications, or of other software components by intercepting function calls or messages or eventspassed between software components.”

The problem is, of course, how to actually do DLL injection and hooking. The more I searched for a simple example that would show how to implement these methods, the more I became confused and aggravated. I didn’t want multiple ways to do it, just one [8]; I didn’t want a huge class library sitting on top of Windows calls [9]; I need to hook CUDA API calls, not Windows Kernel functions [10]; I didn’t want to use something that restricted it’s use just because of the license associated with the source [10]; I didn’t want a complete substitute for the CUDA API, just a wrapper [12]; an example which has only DLL injection, no hooking [13]; an explanation from a peer-reviewed article, but unavailable code because the URL link is invalid [14]; reasonable code, but not a good description [15, 16]. I finally found a way after choosing the most promising solution [9], then single stepped through the program to discover all the system-level calls. The two other simple solutions ([15, 16]) would have worked equally well.

Hooking

Although there are several different methods to perform the hooking, I chose Import Address Table (IAT) patching [9]. In Windows, an executable or a DLL, also known as a module, can call other modules. When a program is executed, the file is loaded into memory. All modules that the program depends on are also loaded. The loader copies each callee module into memory, and updates the addresses of all referenced locations in the caller module associated with the callee module. These addresses are stored in the IAT for the convenience of the loader. In the same way as a loader updates the IAT’s for all modules, hooking involves updating the IAT of the caller module. However, for a hooking wrapper function, the previous value must be saved in order for the wrapper function to call the original CUDA API function.

A typical CUDA program, its compiled and loaded representation is now examined (Figure 1). At 00B41A47in the executable, a call is performed to an inline template for cudaMalloc via a jump table (to 00B4137F). At 00B4137Fin the executable, a jump is performed to the inline template for cudaMalloc (to 00B44AE0). At 00B44AEBin the executable, a call is performed to the IAT for cudaMalloc (to 00B48906). Finally, at 00B48906in the executable, a jump is performed to cudaMalloc in the CUDA runtime DLL (10002F00). Please note that the addresses depend on where the two modules (the executable and the CUDA runtime DLL) are loaded, and will change each time the program is executed. When hooking is done for cudaMalloc, the jump instruction at 00B48906, within the IAT, is changed.

Figure 1: C++ and Assembly code for cudaMalloc.

00B41370  jmp         std::numpunct<char>::`scalar deleting destructor' (0B482D0h)
00B41375  jmp         std::basic_ios<char,std::char_traits><char> >::fill (0B43DC0h)
00B4137A  jmp         std::_Container_base_secure::_Container_base_secure (0B441D0h)
00B4137F  jmp         cudaMalloc<double> (0B44AE0h)
00B41384  jmp         std::_Container_base_secure::~_Container_base_secure (0B43A50h)
00B41389  jmp         std::ios_base::flags (0B42B30h)
00B4138E  jmp         std::allocator<char>::allocator<char> (0B43360h)
…
		double h[] = {1, 2, 3, 4, 5, 6, 7, 8 };
00B419E9  fld1
00B419EB  fstp        qword ptr [h]
00B419EE  fld         qword ptr [__real@4000000000000000 (0CAC4D8h)]
00B419F4  fstp        qword ptr [ebp-40h]
00B419F7  fld         qword ptr [__real@4008000000000000 (0CAC4C8h)]
00B419FD  fstp        qword ptr [ebp-38h]
00B41A00  fld         qword ptr [__real@4010000000000000 (0CAC4B8h)]
00B41A06  fstp        qword ptr [ebp-30h]
00B41A09  fld         qword ptr [__real@4014000000000000 (0CAC4A8h)]
00B41A0F  fstp        qword ptr [ebp-28h]
00B41A12  fld         qword ptr [__real@4018000000000000 (0CAC498h)]
00B41A18  fstp        qword ptr [ebp-20h]
00B41A1B  fld         qword ptr [__real@401c000000000000 (0CAC488h)]
00B41A21  fstp        qword ptr [ebp-18h]
00B41A24  fld         qword ptr [__real@4020000000000000 (0CAC478h)]
00B41A2A  fstp        qword ptr [ebp-10h]
00B41A2D  push        50h
00B41A2F  lea         eax,[d]
00B41A32  push        eax
00B41A33  call        cudaMalloc<double> (0B4137Fh)
00B41A38  add         esp,8
		double * d;
		cudaMalloc(&d, 10*sizeof(double));
		cudaMemcpy(d, h, 10*sizeof(double), cudaMemcpyHostToDevice);
00B41A3B  push        1
00B41A3D  push        50h
00B41A3F  lea         ecx,[h]
00B41A42  push        ecx
00B41A43  mov         edx,dword ptr [d]
00B41A46  push        edx
00B41A47  call        cudaMemcpy (0B488FAh)
		fun<<<1,1>>>(d);
00B41A4C  push        0
00B41A4E  push        0
00B41A50  sub         esp,0Ch
00B41A53  mov         ecx,esp
00B41A55  push        1
00B41A57  push        1
00B41A59  push        1
00B41A5B  call        dim3::dim3 (0B41145h)
00B41A60  sub         esp,0Ch
00B41A63  mov         ecx,esp
00B41A65  push        1
00B41A67  push        1
00B41A69  push        1
00B41A6B  call        dim3::dim3 (0B41145h)
00B41A70  call        cudaConfigureCall (0B488F4h)
00B41A75  test        eax,eax
00B41A77  je          main+0ABh (0B41A7Bh)
00B41A79  jmp         main+0B7h (0B41A87h)
00B41A7B  mov         eax,dword ptr [d]
00B41A7E  push        eax
00B41A7F  call        fun (0B410FFh)
00B41A84  add         esp,4
		cudaThreadSynchronize();
00B41A87  call        cudaThreadSynchronize (0B488EEh)
		int rv = cudaGetLastError();
00B41A8C  call        cudaGetLastError (0B488E8h)
00B41A91  mov         dword ptr [rv],eax
		cudaMemcpy(h, d, 10*sizeof(double), cudaMemcpyDeviceToHost);
00B41A94  push        2
00B41A96  push        50h
00B41A98  mov         ecx,dword ptr [d]
00B41A9B  push        ecx
00B41A9C  lea         edx,[h]
00B41A9F  push        edx
00B41AA0  call        cudaMemcpy (0B488FAh)
…
template<class t="">
__inline__ __host__ cudaError_t cudaMalloc(
  T      **devPtr,
  size_t   size
)
{
00B44AE0  push        ebp
00B44AE1  mov         ebp,esp
00B44AE3  mov         eax,dword ptr [size]
00B44AE6  push        eax
00B44AE7  mov         ecx,dword ptr [devPtr]
00B44AEA  push        ecx
00B44AEB  call        cudaMalloc (0B48906h)
  return cudaMalloc((void**)(void*)devPtr, size);
}
00B44AF0  pop         ebp
00B44AF1  ret
…
cudaFree:
00B488E2  jmp         dword ptr [__imp__cudaFree@4 (0CC7574h)]
cudaGetLastError:
00B488E8  jmp         dword ptr [__imp__cudaGetLastError@0 (0CC7570h)]
cudaThreadSynchronize:
00B488EE  jmp         dword ptr [__imp__cudaThreadSynchronize@0 (0CC756Ch)]
cudaConfigureCall:
00B488F4  jmp         dword ptr [__imp__cudaConfigureCall@32 (0CC7568h)]
cudaMemcpy:
00B488FA  jmp         dword ptr [__imp__cudaMemcpy@16 (0CC7564h)]
cudaSetDevice:
00B48900  jmp         dword ptr [__imp__cudaSetDevice@4 (0CC7560h)]
cudaMalloc:
00B48906  jmp         dword ptr [__imp__cudaMalloc@8 (0CC755Ch)]
cuMemGetInfo:
00B4890C  jmp         dword ptr [__imp__cuMemGetInfo@8 (0CC75C4h)]
…
10002EFE  int         3
10002EFF  int         3
10002F00  push        ebp
10002F01  mov         ebp,esp
10002F03  push        0FFFFFFFFh
10002F05  push        10023E40h
10002F0A  mov         eax,dword ptr fs:[00000000h]
10002F10  push        eax
10002F11  sub         esp,0Ch
10002F14  push        ebx
10002F15  push        esi
10002F16  push        edi
10002F17  mov         eax,dword ptr ds:[1003124Ch]

The algorithm for hooking is simple, but an understanding depends on knowing several Windows functions and data structures (Figure 2). Basically, given a module and function to hook, it finds all locations in memory that reference that function and substitutes the contents with the address of the wrapper function.

Figure 2: Hooking(String callee-module, String callee-function, Pointer subst)

Make sure callee-module is loaded.  Otherwise, there is no IAT to patch in any caller module.

// Using Windows LoadLibraryA and GetProcAddress:
Pointer previous-address = the address in memory of the callee-function

// Using Windows GetCurrentProcessId, OpenProcess, and EnumProcessModules:
HMODULE module-array[ ] = …;

foreach (HMODULE caller-module in module-array)
{
    // Using caller-module and Windows ImageDirectoryEntryToData, get an
    // array of IMAGE_IMPORT_DESCRIPTOR.
    IMAGE_IMPORT_DESCRIPTOR caller-module-array[ ] = …;

    foreach (IMAGE_IMPORT_DESCRIPTOR import-description in caller-module-array)
    {
        String callee =  import-description.Name;
        if (callee == callee-module)
        {
            IMAGE_THUNK_DATA callee-iat-array[ ] = import-description.FirstThunk;
            {
                foreach (IMAGE_THUNK_DATA callee-iat in callee-iat-array)
                {
                    Pointer callee-addr = callee-iat.Function;
                    if (callee-addr != previous-address)
                        continue;
                    Pointer ptr-callee-addr = & callee-iat.Function;
                    // Using Windows VirtualQuery, set protection level of
                    // memory containing IAT so it can be written to.
                    * ptr-callee-addr = subst;
                }
            }
        }
    }
}

DLL injection

In order to add the hooking code to a user program, the code must be packaged into a DLL, and the DLL injected into another process. Injection is a two-step process. The first step is to create a chunk of memory in a target process, then copy code that loads the DLL. The second step is to execute this code. The problem is when and how to create the process. If you want to perform hooking of calls that could potentially occur during the initialization of that process, you must create the process in a suspended state so that DLL injection can be done prior to the program running. The most difficult part of this algorithm is creating actual x86 instructions to copy into the target process. The algorithm for DLL injection is shown in Figure 3.

Figure 3: DLL-injection(String program-name, String dll-name)

Given the name of the program, create target process in suspended state using Windows CreateProcess, Int process-id = …;

Using OpenProcessToken, LookupPrivilegeValue, AdjustTokenPrivileges, adjust privileges of host process in order to write to target process;

Using GetThreadContext, get pointer to start of the program, Pointer originalEip;

Using OpenProcess, VirtualAllocEx, and WriteProcessMemory, create and write out dll-name into memory.  Let the address of that memory be target-dll-name.

Using VirtualAllocEx, and WriteProcessMemory, create and write out code that performs a LoadLibraryA(target-dll-name), and a jump to the start of the program, originalEip;

Resume the target process.

Optional: Wait for the process to end before finishing host process.

Conclusion

Writing a simple wrapper for cudaMalloc and cudaFree is fairly easy, once you understand how to implement DLL injection and API hooking. Unfortunately, what should be a well-described technique is actually hard to find. The algorithms listed in pseudo code are hard to find, if not non-existant. This article describes the two algorithms, free of the unnecessary clutter of other abstractions that developers add for full-featured programs.

Code

Code that demonstrates the two algorithms is here: http://domemtech.com/code/injection-and-hooking.zip, which is a Visual Studio 2010 project. The CUDA memory debugger can be found at http://code.google.com/p/cuda-memory-debug/.

References

http://developer.nvidia.com/object/cuda_3_1_downloads.html
http://developer.nvidia.com/object/nsight.html
http://code.google.com/p/gpuocelot/
http://en.wikipedia.org/wiki/BoundsChecker
http://dmalloc.com/
DLL Injection, http://en.wikipedia.org/wiki/DLL_injection
Hooking, http://en.wikipedia.org/wiki/Hooking
Three Ways To Inject Your Code Into Another Process, Robert Kuster, August 4, 2003, http://www.codeguru.com/Cpp/W-P/system/processesmodules/article.php/c5767; also http://www.codeproject.com/KB/threads/winspy.aspx
API hooking revealed, Ivo Ivanov, Dec 3, 2002, http://www.codeproject.com/KB/system/hooksys.aspx
Hooks and DLLs, Joseph M. Newcomer, April 1, 2001, http://www.codeproject.com/KB/DLL/hooks.aspx
http://research.microsoft.com/en-us/projects/detours/
http://sandsprite.com/CodeStuff/IAT_Hooking.html
http://newgre.net/ninjectlib
Berdajs, J. and BosniÄ, Z. (2010), Extending applications using an advanced approach to DLL injection and API hooking. Software: Practice and Experience, 40: 567–584. doi: 10.1002/spe.973
A (working) implementation of API hooking (Part I), http://www.codeproject.com/KB/system/APIHookingRevisited.aspx?msg=3186060
DLL Injection and function interception tutorial, http://www.codeproject.com/KB/DLL/DLL_Injection_tutorial.aspx

Writing a wrapper for the CUDA memory allocation API

Leave a comment

Cancel reply