As a software developer, you may be faced with the problem of modifying the behavior of a third-party program. For example, you're writing a debugger that will check the usage of memory allocation in the program being debugged. You often do not have the source for the program, so you cannot change it. And, you do not have the sources for the memory allocation library that the program uses, so you cannot change that code either. This is just the problem I was facing with Nvidia's CUDA API [1]. Although it is not a particularly hard problem, what I learned was yet another example of how frustrating it is to find a solution that should be well described, but is not.
The problem
I have been programming with CUDA C++ SDK for Windows lately, and noticed that I often don't use the API correctly. Some of the oversights I make are:
- I forget to check the return value from cudaMalloc. In this example, the return values of the two CUDA API calls are never checked. However, most likely, they would have succeeded, but this is a bad assumption.
... int h = 1; int * d; cudaMalloc(&d, sizeof(h)); cudaMemcpy(d, &h, sizeof(h), cudaMemcpyHostToDevice);
- I allocate a buffer with the wrong size. In this example, the size of the variable “d” allocated in device global memory is 10 bytes long, not 40 bytes long, as it should be. In addition, the return values of the two CUDA API calls are never checked.
... int h[10]; int * d; cudaMalloc(&d, 10); cudaMemcpy(d, &h, 10, cudaMemcpyHostToDevice);
- I use a host pointer as a device pointer, and vice versa. In this example, the intension was to initialize the device memory, but the parameters to cudaMemcpy are reversed.
... int h[10]; int * d; cudaMalloc(&d, 10); cudaMemcpy(&h, d, 10, cudaMemcpyHostToDevice);
Two developer tools, Nsight [2] and Ocelot [3], seem as though they should help, but do not. Nsight does not detect the first two problems, but does for the third. Nsight also requires you to recompile your program with the -G0 option [1] and relink. Ocelot, which should be able to detect all of these problems, does not exist for Windows. There are no additional tools that seem to help. So, I decided to write a wrapper for the CUDA API memory allocation functions. This would help report errors, and find memory overwrites, unfreed blocks of memory, and incorrect use of pointers with these API functions.
There are several ways to implement this programming task, some solutions easier than others. However, I did not want to recompile and relink my program in order to debug it. I decided wanted a tool that would add wrapper functions for the CUDA API at runtime.
First steps
The first the problem I needed to solve was to write the wrapper for cudaMalloc and cudaFree. I first noticed that cudaMalloc and cudaFree are very much like malloc and free. This similarity reminded me of an old solution for Windows, BoundsChecker [4], which I used many years ago. In those days, BoundsChecker required you to recompile and relink your program using a special include file and library. The include file contained macros for malloc and free, which substituted debugging wrapper functions for malloc and free. A more contemporary example that uses the same solution is Dmalloc [5]. Following this by example, I wrote an include file and library for debugging a CUDA program.
How do I do DLL injection and hooking?
The next step was to find a way to insert this code without the need to recompile and relink my program. Searching Google for “intercepting calls to a DLL windows”, I learned that I would most likely need to use “DLL injection” [6]. “In computer programming, DLL injection is a technique used to run code within the address space of another process by forcing it to load a dynamic-link library.”
But, DLL injection was not sufficient. Even if I was able to copy my wrapper code into another process, each call to cudaMalloc and cudaFree needed to be changed. To do this, I needed to do “hooking” [6]. “In computer programming, the term hooking covers a range of techniques used to alter or augment the behavior of anoperating system, of applications, or of other software components by intercepting function calls or messages or eventspassed between software components.”
The problem is, of course, how to actually do DLL injection and hooking. The more I searched for a simple example that would show how to implement these methods, the more I became confused and aggravated. I didn’t want multiple ways to do it, just one [8]; I didn’t want a huge class library sitting on top of Windows calls [9]; I need to hook CUDA API calls, not Windows Kernel functions [10]; I didn’t want to use something that restricted it’s use just because of the license associated with the source [10]; I didn’t want a complete substitute for the CUDA API, just a wrapper [12]; an example which has only DLL injection, no hooking [13]; an explanation from a peer-reviewed article, but unavailable code because the URL link is invalid [14]; reasonable code, but not a good description [15, 16]. I finally found a way after choosing the most promising solution [9], then single stepped through the program to discover all the system-level calls. The two other simple solutions ([15, 16]) would have worked equally well.
Hooking
Although there are several different methods to perform the hooking, I chose Import Address Table (IAT) patching [9]. In Windows, an executable or a DLL, also known as a module, can call other modules. When a program is executed, the file is loaded into memory. All modules that the program depends on are also loaded. The loader copies each callee module into memory, and updates the addresses of all referenced locations in the caller module associated with the callee module. These addresses are stored in the IAT for the convenience of the loader. In the same way as a loader updates the IAT’s for all modules, hooking involves updating the IAT of the caller module. However, for a hooking wrapper function, the previous value must be saved in order for the wrapper function to call the original CUDA API function.
A typical CUDA program, its compiled and loaded representation is now examined (Figure 1). At 00B41A47in the executable, a call is performed to an inline template for cudaMalloc via a jump table (to 00B4137F). At 00B4137Fin the executable, a jump is performed to the inline template for cudaMalloc (to 00B44AE0). At 00B44AEBin the executable, a call is performed to the IAT for cudaMalloc (to 00B48906). Finally, at 00B48906in the executable, a jump is performed to cudaMalloc in the CUDA runtime DLL (10002F00). Please note that the addresses depend on where the two modules (the executable and the CUDA runtime DLL) are loaded, and will change each time the program is executed. When hooking is done for cudaMalloc, the jump instruction at 00B48906, within the IAT, is changed.
Figure 1: C++ and Assembly code for cudaMalloc.
00B41370 jmp std::numpunct<char>::`scalar deleting destructor' (0B482D0h) 00B41375 jmp std::basic_ios<char,std::char_traits><char> >::fill (0B43DC0h) 00B4137A jmp std::_Container_base_secure::_Container_base_secure (0B441D0h) 00B4137F jmp cudaMalloc<double> (0B44AE0h) 00B41384 jmp std::_Container_base_secure::~_Container_base_secure (0B43A50h) 00B41389 jmp std::ios_base::flags (0B42B30h) 00B4138E jmp std::allocator<char>::allocator<char> (0B43360h) … double h[] = {1, 2, 3, 4, 5, 6, 7, 8 }; 00B419E9 fld1 00B419EB fstp qword ptr [h] 00B419EE fld qword ptr [__real@4000000000000000 (0CAC4D8h)] 00B419F4 fstp qword ptr [ebp-40h] 00B419F7 fld qword ptr [__real@4008000000000000 (0CAC4C8h)] 00B419FD fstp qword ptr [ebp-38h] 00B41A00 fld qword ptr [__real@4010000000000000 (0CAC4B8h)] 00B41A06 fstp qword ptr [ebp-30h] 00B41A09 fld qword ptr [__real@4014000000000000 (0CAC4A8h)] 00B41A0F fstp qword ptr [ebp-28h] 00B41A12 fld qword ptr [__real@4018000000000000 (0CAC498h)] 00B41A18 fstp qword ptr [ebp-20h] 00B41A1B fld qword ptr [__real@401c000000000000 (0CAC488h)] 00B41A21 fstp qword ptr [ebp-18h] 00B41A24 fld qword ptr [__real@4020000000000000 (0CAC478h)] 00B41A2A fstp qword ptr [ebp-10h] 00B41A2D push 50h 00B41A2F lea eax,[d] 00B41A32 push eax 00B41A33 call cudaMalloc<double> (0B4137Fh) 00B41A38 add esp,8 double * d; cudaMalloc(&d, 10*sizeof(double)); cudaMemcpy(d, h, 10*sizeof(double), cudaMemcpyHostToDevice); 00B41A3B push 1 00B41A3D push 50h 00B41A3F lea ecx,[h] 00B41A42 push ecx 00B41A43 mov edx,dword ptr [d] 00B41A46 push edx 00B41A47 call cudaMemcpy (0B488FAh) fun<<<1,1>>>(d); 00B41A4C push 0 00B41A4E push 0 00B41A50 sub esp,0Ch 00B41A53 mov ecx,esp 00B41A55 push 1 00B41A57 push 1 00B41A59 push 1 00B41A5B call dim3::dim3 (0B41145h) 00B41A60 sub esp,0Ch 00B41A63 mov ecx,esp 00B41A65 push 1 00B41A67 push 1 00B41A69 push 1 00B41A6B call dim3::dim3 (0B41145h) 00B41A70 call cudaConfigureCall (0B488F4h) 00B41A75 test eax,eax 00B41A77 je main+0ABh (0B41A7Bh) 00B41A79 jmp main+0B7h (0B41A87h) 00B41A7B mov eax,dword ptr [d] 00B41A7E push eax 00B41A7F call fun (0B410FFh) 00B41A84 add esp,4 cudaThreadSynchronize(); 00B41A87 call cudaThreadSynchronize (0B488EEh) int rv = cudaGetLastError(); 00B41A8C call cudaGetLastError (0B488E8h) 00B41A91 mov dword ptr [rv],eax cudaMemcpy(h, d, 10*sizeof(double), cudaMemcpyDeviceToHost); 00B41A94 push 2 00B41A96 push 50h 00B41A98 mov ecx,dword ptr [d] 00B41A9B push ecx 00B41A9C lea edx,[h] 00B41A9F push edx 00B41AA0 call cudaMemcpy (0B488FAh) … template<class t=""> __inline__ __host__ cudaError_t cudaMalloc( T **devPtr, size_t size ) { 00B44AE0 push ebp 00B44AE1 mov ebp,esp 00B44AE3 mov eax,dword ptr [size] 00B44AE6 push eax 00B44AE7 mov ecx,dword ptr [devPtr] 00B44AEA push ecx 00B44AEB call cudaMalloc (0B48906h) return cudaMalloc((void**)(void*)devPtr, size); } 00B44AF0 pop ebp 00B44AF1 ret … cudaFree: 00B488E2 jmp dword ptr [__imp__cudaFree@4 (0CC7574h)] cudaGetLastError: 00B488E8 jmp dword ptr [__imp__cudaGetLastError@0 (0CC7570h)] cudaThreadSynchronize: 00B488EE jmp dword ptr [__imp__cudaThreadSynchronize@0 (0CC756Ch)] cudaConfigureCall: 00B488F4 jmp dword ptr [__imp__cudaConfigureCall@32 (0CC7568h)] cudaMemcpy: 00B488FA jmp dword ptr [__imp__cudaMemcpy@16 (0CC7564h)] cudaSetDevice: 00B48900 jmp dword ptr [__imp__cudaSetDevice@4 (0CC7560h)] cudaMalloc: 00B48906 jmp dword ptr [__imp__cudaMalloc@8 (0CC755Ch)] cuMemGetInfo: 00B4890C jmp dword ptr [__imp__cuMemGetInfo@8 (0CC75C4h)] … 10002EFE int 3 10002EFF int 3 10002F00 push ebp 10002F01 mov ebp,esp 10002F03 push 0FFFFFFFFh 10002F05 push 10023E40h 10002F0A mov eax,dword ptr fs:[00000000h] 10002F10 push eax 10002F11 sub esp,0Ch 10002F14 push ebx 10002F15 push esi 10002F16 push edi 10002F17 mov eax,dword ptr ds:[1003124Ch]
The algorithm for hooking is simple, but an understanding depends on knowing several Windows functions and data structures (Figure 2). Basically, given a module and function to hook, it finds all locations in memory that reference that function and substitutes the contents with the address of the wrapper function.
Figure 2: Hooking(String callee-module, String callee-function, Pointer subst)
Make sure callee-module is loaded. Otherwise, there is no IAT to patch in any caller module. // Using Windows LoadLibraryA and GetProcAddress: Pointer previous-address = the address in memory of the callee-function // Using Windows GetCurrentProcessId, OpenProcess, and EnumProcessModules: HMODULE module-array[ ] = …; foreach (HMODULE caller-module in module-array) { // Using caller-module and Windows ImageDirectoryEntryToData, get an // array of IMAGE_IMPORT_DESCRIPTOR. IMAGE_IMPORT_DESCRIPTOR caller-module-array[ ] = …; foreach (IMAGE_IMPORT_DESCRIPTOR import-description in caller-module-array) { String callee = import-description.Name; if (callee == callee-module) { IMAGE_THUNK_DATA callee-iat-array[ ] = import-description.FirstThunk; { foreach (IMAGE_THUNK_DATA callee-iat in callee-iat-array) { Pointer callee-addr = callee-iat.Function; if (callee-addr != previous-address) continue; Pointer ptr-callee-addr = & callee-iat.Function; // Using Windows VirtualQuery, set protection level of // memory containing IAT so it can be written to. * ptr-callee-addr = subst; } } } } }
DLL injection
In order to add the hooking code to a user program, the code must be packaged into a DLL, and the DLL injected into another process. Injection is a two-step process. The first step is to create a chunk of memory in a target process, then copy code that loads the DLL. The second step is to execute this code. The problem is when and how to create the process. If you want to perform hooking of calls that could potentially occur during the initialization of that process, you must create the process in a suspended state so that DLL injection can be done prior to the program running. The most difficult part of this algorithm is creating actual x86 instructions to copy into the target process. The algorithm for DLL injection is shown in Figure 3.
Figure 3: DLL-injection(String program-name, String dll-name)
Given the name of the program, create target process in suspended state using Windows CreateProcess, Int process-id = …; Using OpenProcessToken, LookupPrivilegeValue, AdjustTokenPrivileges, adjust privileges of host process in order to write to target process; Using GetThreadContext, get pointer to start of the program, Pointer originalEip; Using OpenProcess, VirtualAllocEx, and WriteProcessMemory, create and write out dll-name into memory. Let the address of that memory be target-dll-name. Using VirtualAllocEx, and WriteProcessMemory, create and write out code that performs a LoadLibraryA(target-dll-name), and a jump to the start of the program, originalEip; Resume the target process. Optional: Wait for the process to end before finishing host process.
Conclusion
Writing a simple wrapper for cudaMalloc and cudaFree is fairly easy, once you understand how to implement DLL injection and API hooking. Unfortunately, what should be a well-described technique is actually hard to find. The algorithms listed in pseudo code are hard to find, if not non-existant. This article describes the two algorithms, free of the unnecessary clutter of other abstractions that developers add for full-featured programs.
Code
Code that demonstrates the two algorithms is here: http://domemtech.com/code/injection-and-hooking.zip, which is a Visual Studio 2010 project. The CUDA memory debugger can be found at http://code.google.com/p/cuda-memory-debug/.
References
- http://developer.nvidia.com/object/cuda_3_1_downloads.html
- http://developer.nvidia.com/object/nsight.html
- http://code.google.com/p/gpuocelot/
- http://en.wikipedia.org/wiki/BoundsChecker
- http://dmalloc.com/
- DLL Injection, http://en.wikipedia.org/wiki/DLL_injection
- Hooking, http://en.wikipedia.org/wiki/Hooking
- Three Ways To Inject Your Code Into Another Process, Robert Kuster, August 4, 2003, http://www.codeguru.com/Cpp/W-P/system/processesmodules/article.php/c5767; also http://www.codeproject.com/KB/threads/winspy.aspx
- API hooking revealed, Ivo Ivanov, Dec 3, 2002, http://www.codeproject.com/KB/system/hooksys.aspx
- Hooks and DLLs, Joseph M. Newcomer, April 1, 2001, http://www.codeproject.com/KB/DLL/hooks.aspx
- http://research.microsoft.com/en-us/projects/detours/
- http://sandsprite.com/CodeStuff/IAT_Hooking.html
- http://newgre.net/ninjectlib
- Berdajs, J. and BosniÄ, Z. (2010), Extending applications using an advanced approach to DLL injection and API hooking. Software: Practice and Experience, 40: 567–584. doi: 10.1002/spe.973
- A (working) implementation of API hooking (Part I), http://www.codeproject.com/KB/system/APIHookingRevisited.aspx?msg=3186060
- DLL Injection and function interception tutorial, http://www.codeproject.com/KB/DLL/DLL_Injection_tutorial.aspx