{"id":504,"date":"2010-08-13T13:34:14","date_gmt":"2010-08-13T20:34:14","guid":{"rendered":"http:\/\/domemtech.com\/?p=504"},"modified":"2010-08-22T05:37:08","modified_gmt":"2010-08-22T12:37:08","slug":"writing-a-wrapper-for-the-cuda-memory-allocation-api","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2010\/08\/13\/writing-a-wrapper-for-the-cuda-memory-allocation-api\/","title":{"rendered":"Writing a wrapper for the CUDA memory allocation API"},"content":{"rendered":"<p>&nbsp;<\/p>\n<p style=\"text-align: justify; \">As a software developer, you may be faced with the problem of modifying the behavior of a third-party program. &nbsp;For example, you&#39;re writing a debugger that will check the usage of memory allocation in the program being debugged. &nbsp;You often do not have the source for the program, so you cannot change it. &nbsp;And, you do not have the sources for the memory allocation library that the program uses, so you cannot change that code either. &nbsp; This is just the problem I was facing with Nvidia&#39;s CUDA API [<a href=\"#ref1\">1<\/a>]. &nbsp;Although it is not a particularly hard problem, what I learned was yet another example of how frustrating it is to find a solution that should be well described, but is not.&nbsp;<\/p>\n<p><!--more--><\/p>\n<p><strong>The problem<\/strong><\/p>\n<p style=\"text-align: justify; \">I have&nbsp;been programming with CUDA C++&nbsp;SDK for Windows&nbsp;lately, and noticed that I often don&#39;t use the API correctly. &nbsp;Some of the oversights I make are:<\/p>\n<ul>\n<li style=\"text-align: justify; \">I forget to check the return value from cudaMalloc.&nbsp; In this example, the return values of the two CUDA API calls are never checked.&nbsp; However, most likely, they would have succeeded, but this is a bad assumption.<\/li>\n<\/ul>\n<p><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shCore.js\"><\/script><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shBrushCpp.js\"><\/script><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shBrushPlain.js\"><\/script><script type=\"text\/javascript\" src=\"syntaxhighlighter\/scripts\/shBrushAsm.js\"><\/script><\/p>\n<link href=\"syntaxhighlighter\/styles\/shCore.css\" rel=\"stylesheet\" type=\"text\/css\" \/>\n<link href=\"syntaxhighlighter\/styles\/shThemeEclipse.css\" rel=\"Stylesheet\" type=\"text\/css\" \/>\n<pre class=\"brush: cpp; first-line: 1\">...\r\nint h = 1;\r\nint * d;\r\ncudaMalloc(&amp;d, sizeof(h));\r\ncudaMemcpy(d, &amp;h, sizeof(h),&nbsp;cudaMemcpyHostToDevice);\r\n<\/pre>\n<ul>\n<li style=\"text-align: justify; \">I allocate a buffer with the wrong size.&nbsp; In this example, the size of the variable &ldquo;d&rdquo; allocated in device global memory is 10 bytes long, not 40 bytes long, as it should be.&nbsp; In addition, the return values of the two CUDA API calls are never checked.<\/li>\n<\/ul>\n<pre class=\"brush: cpp; first-line: 1\">...\r\nint h[10];\r\nint * d;\r\ncudaMalloc(&amp;d, 10);\r\ncudaMemcpy(d, &amp;h, 10,&nbsp;cudaMemcpyHostToDevice);\r\n<\/pre>\n<ul>\n<li style=\"text-align: justify; \">I use a host pointer as a device pointer, and vice versa.&nbsp; In this example, the intension was to initialize the device memory, but the parameters to cudaMemcpy are reversed.<\/li>\n<\/ul>\n<pre class=\"brush: cpp; first-line: 1\">...\r\nint h[10];\r\nint * d;\r\ncudaMalloc(&amp;d, 10);\r\ncudaMemcpy(&amp;h, d, 10,&nbsp;cudaMemcpyHostToDevice);\r\n<\/pre>\n<p style=\"text-align: justify; \">Two developer tools, Nsight [<a href=\"#ref-nsight\">2<\/a>]&nbsp;and Ocelot [<a href=\"#ref-ocelot\">3<\/a>], seem as though they should help, but do not.&nbsp; Nsight does not detect the first two problems, but does for the third.&nbsp; Nsight also requires you to recompile your program with the -G0 option [<a href=\"#ref1\">1<\/a>] and relink.&nbsp; Ocelot, which should be able to detect all of these problems, does not exist for Windows.&nbsp; There are no additional tools that seem to help. &nbsp;So, I decided to write a wrapper for the CUDA API memory allocation functions. &nbsp;This would help report errors, and find memory overwrites, unfreed blocks of memory, and incorrect use of pointers with these API functions.<\/p>\n<p style=\"text-align: justify; \">There are several ways to implement this programming task, some solutions easier than others. &nbsp;However, I did not want to recompile and relink my program in order to debug it. &nbsp;I decided wanted a tool that would add wrapper functions for the CUDA API at runtime.<\/p>\n<p><strong>First steps<\/strong><\/p>\n<p style=\"text-align: justify; \">The first the problem I needed to solve was to write the wrapper for&nbsp;<em>cudaMalloc&nbsp;<\/em>and&nbsp;<em>cudaFree<\/em>. &nbsp;I first noticed that <em>cudaMalloc <\/em>and <em>cudaFree <\/em>are very much like&nbsp;<em>malloc<\/em>&nbsp;and&nbsp;<em>free.<\/em>&nbsp;&nbsp;This similarity reminded me of an old solution for Windows, <em>BoundsChecker <\/em>[<a href=\"#ref-boundschecker\">4<\/a>], which I used many years ago. &nbsp;In those days, BoundsChecker required you to recompile and relink your program using a special include file and library.&nbsp; The include file contained macros for <em>malloc&nbsp;<\/em>and&nbsp;<em>free, <\/em>which&nbsp;substituted debugging wrapper functions for <em>malloc<\/em> and <em>free<\/em>. &nbsp;A more contemporary example that uses the same solution is Dmalloc [<a href=\"#ref-dmalloc\">5<\/a>]. Following this by example, I wrote an include file and library for debugging a CUDA program.<\/p>\n<p><strong>How do I do DLL injection and hooking?<\/strong><\/p>\n<p style=\"text-align: justify; \">The next step was to find a way to insert this code without the need to recompile and relink my program.&nbsp; Searching Google for &ldquo;intercepting calls to a DLL windows&rdquo;, I learned that I would most likely need to use &ldquo;DLL injection&rdquo; [<a href=\"#ref-dll-injection\">6<\/a>].&nbsp; &ldquo;In&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Computer_programming\" title=\"Computer programming\">computer programming<\/a>,&nbsp;<strong>DLL injection<\/strong>&nbsp;is a technique used to run&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Code_(computer_programming)\" title=\"Code (computer programming)\">code<\/a>&nbsp;within the&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Address_space\" title=\"Address space\">address space<\/a>&nbsp;of another&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Process_(computing)\" title=\"Process (computing)\">process<\/a>&nbsp;by forcing it to load a&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Dynamic-link_library\" title=\"Dynamic-link library\">dynamic-link library<\/a>.&rdquo;<\/p>\n<p style=\"text-align: justify; \">But, DLL injection was not sufficient.&nbsp; Even if I was able to copy my wrapper code into another process, each call to <em>cudaMalloc<\/em> and <em>cudaFree <\/em>needed to be changed. &nbsp;To do this, I needed to do &ldquo;hooking&rdquo; [<a href=\"#ref-hooking\">6<\/a>].&nbsp; &ldquo;In&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Computer_programming\" title=\"Computer programming\">computer programming<\/a>, the term&nbsp;<strong>hooking<\/strong>&nbsp;covers a range of techniques used to alter or augment the behavior of an<a href=\"http:\/\/en.wikipedia.org\/wiki\/Operating_system\" title=\"Operating system\">operating system<\/a>, of&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Application_software\" title=\"Application software\">applications<\/a>, or of other software components by intercepting&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Subroutine\" title=\"Subroutine\">function calls<\/a>&nbsp;or&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Message_passing\" title=\"Message passing\">messages<\/a>&nbsp;or&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Event_(computing)\" title=\"Event (computing)\">events<\/a>passed between&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Module\" title=\"Module\">software components<\/a>.&rdquo;<\/p>\n<p style=\"text-align: justify; \">The problem is, of course, how to actually do DLL injection and hooking. &nbsp;The more I searched for a simple example that would show how to implement these methods, the more I became confused and aggravated.&nbsp;&nbsp;I didn&rsquo;t want multiple ways to do it, just one [<a href=\"#ref-three-ways\">8<\/a>]; I didn&rsquo;t want a huge class library sitting on top of Windows calls [<a href=\"#ref-class-lib\">9<\/a>]; I need to hook CUDA API calls, not Windows Kernel functions [<a href=\"#ref-only-kern\">1<\/a>0]; I didn&rsquo;t want to use something that restricted it&rsquo;s use just because of the license associated with the source [<a href=\"#ref-detours\">10<\/a>]; I didn&rsquo;t want a complete substitute for the CUDA API, just a wrapper [<a href=\"#ref-iat-hooking\">12<\/a>]; an example which has only DLL injection, no hooking [<a href=\"#ref-ninjectlib\">13<\/a>]; an explanation from a peer-reviewed article, but unavailable code because the URL link is invalid [<a href=\"#ref-berdajs\">14<\/a>]; reasonable code, but not a good description [<a href=\"#ref-good-1\">15<\/a>, <a href=\"#ref-good-2\">16<\/a>]. &nbsp;I finally found a way after choosing the most promising solution [<a href=\"#ref-class-lib\">9<\/a>], then single stepped through the program to discover all the system-level calls. &nbsp;The two other simple solutions ([<a href=\"#ref-good-1\">15<\/a>,&nbsp;<a href=\"#ref-good-2\">16<\/a>]) would have worked equally well.<\/p>\n<p><strong>Hooking<\/strong><\/p>\n<p style=\"text-align: justify; \">Although there are several different methods to perform the hooking, I chose&nbsp;<em>Import Address Table (IAT) patching&nbsp;<\/em>[<a href=\"#ref-class-lib\">9<\/a>]. &nbsp;In Windows, an executable or a DLL, also known as a <em>module<\/em>, can call other modules.&nbsp; When a program is executed, the file is loaded into memory.&nbsp; All modules that the program depends on are also loaded.&nbsp; The loader copies each callee module into memory, and updates the addresses of all referenced locations in the caller module associated with the callee module. &nbsp;These addresses are stored in the IAT for the convenience&nbsp;of the loader. &nbsp;In the same way as a loader updates the IAT&rsquo;s for all modules, hooking involves updating the IAT of the caller module.&nbsp; However, for a hooking wrapper function, the previous value must be saved in order for the wrapper function to call the original CUDA API function.<\/p>\n<p style=\"text-align: justify; \">A typical CUDA program, its compiled and loaded representation is now examined (Figure 1).&nbsp; At 00B41A47in the executable, a call is performed to an inline template for cudaMalloc via a jump table (to 00B4137F).&nbsp; At 00B4137Fin the executable, a jump is performed to the inline template for cudaMalloc (to 00B44AE0). &nbsp;At 00B44AEBin the executable, a call is performed to the IAT for cudaMalloc (to 00B48906).&nbsp; Finally, at 00B48906in the executable, a jump is performed to cudaMalloc in the CUDA runtime DLL (10002F00).&nbsp; Please note that the addresses depend on where the two modules (the executable and the CUDA runtime DLL) are loaded, and will change each time the program is executed.&nbsp; When hooking is done for cudaMalloc, the jump instruction at 00B48906, within the IAT, is changed.<\/p>\n<p>Figure 1: C++ and Assembly code for cudaMalloc.<\/p>\n<pre class=\"brush: asm; first-line: 1\">00B41370  jmp         std::numpunct&lt;char&gt;::`scalar deleting destructor&#39; (0B482D0h)\r\n00B41375  jmp         std::basic_ios&lt;char,std::char_traits&gt;&lt;char&gt; &gt;::fill (0B43DC0h)\r\n00B4137A  jmp         std::_Container_base_secure::_Container_base_secure (0B441D0h)\r\n00B4137F  jmp         cudaMalloc&lt;double&gt; (0B44AE0h)\r\n00B41384  jmp         std::_Container_base_secure::~_Container_base_secure (0B43A50h)\r\n00B41389  jmp         std::ios_base::flags (0B42B30h)\r\n00B4138E  jmp         std::allocator&lt;char&gt;::allocator&lt;char&gt; (0B43360h)\r\n&hellip;\r\n\t\tdouble h[] = {1, 2, 3, 4, 5, 6, 7, 8 };\r\n00B419E9  fld1\r\n00B419EB  fstp        qword ptr [h]\r\n00B419EE  fld         qword ptr [__real@4000000000000000 (0CAC4D8h)]\r\n00B419F4  fstp        qword ptr [ebp-40h]\r\n00B419F7  fld         qword ptr [__real@4008000000000000 (0CAC4C8h)]\r\n00B419FD  fstp        qword ptr [ebp-38h]\r\n00B41A00  fld         qword ptr [__real@4010000000000000 (0CAC4B8h)]\r\n00B41A06  fstp        qword ptr [ebp-30h]\r\n00B41A09  fld         qword ptr [__real@4014000000000000 (0CAC4A8h)]\r\n00B41A0F  fstp        qword ptr [ebp-28h]\r\n00B41A12  fld         qword ptr [__real@4018000000000000 (0CAC498h)]\r\n00B41A18  fstp        qword ptr [ebp-20h]\r\n00B41A1B  fld         qword ptr [__real@401c000000000000 (0CAC488h)]\r\n00B41A21  fstp        qword ptr [ebp-18h]\r\n00B41A24  fld         qword ptr [__real@4020000000000000 (0CAC478h)]\r\n00B41A2A  fstp        qword ptr [ebp-10h]\r\n00B41A2D  push        50h\r\n00B41A2F  lea         eax,[d]\r\n00B41A32  push        eax\r\n00B41A33  call        cudaMalloc&lt;double&gt; (0B4137Fh)\r\n00B41A38  add         esp,8\r\n\t\tdouble * d;\r\n\t\tcudaMalloc(&amp;d, 10*sizeof(double));\r\n\t\tcudaMemcpy(d, h, 10*sizeof(double), cudaMemcpyHostToDevice);\r\n00B41A3B  push        1\r\n00B41A3D  push        50h\r\n00B41A3F  lea         ecx,[h]\r\n00B41A42  push        ecx\r\n00B41A43  mov         edx,dword ptr [d]\r\n00B41A46  push        edx\r\n00B41A47  call        cudaMemcpy (0B488FAh)\r\n\t\tfun&lt;&lt;&lt;1,1&gt;&gt;&gt;(d);\r\n00B41A4C  push        0\r\n00B41A4E  push        0\r\n00B41A50  sub         esp,0Ch\r\n00B41A53  mov         ecx,esp\r\n00B41A55  push        1\r\n00B41A57  push        1\r\n00B41A59  push        1\r\n00B41A5B  call        dim3::dim3 (0B41145h)\r\n00B41A60  sub         esp,0Ch\r\n00B41A63  mov         ecx,esp\r\n00B41A65  push        1\r\n00B41A67  push        1\r\n00B41A69  push        1\r\n00B41A6B  call        dim3::dim3 (0B41145h)\r\n00B41A70  call        cudaConfigureCall (0B488F4h)\r\n00B41A75  test        eax,eax\r\n00B41A77  je          main+0ABh (0B41A7Bh)\r\n00B41A79  jmp         main+0B7h (0B41A87h)\r\n00B41A7B  mov         eax,dword ptr [d]\r\n00B41A7E  push        eax\r\n00B41A7F  call        fun (0B410FFh)\r\n00B41A84  add         esp,4\r\n\t\tcudaThreadSynchronize();\r\n00B41A87  call        cudaThreadSynchronize (0B488EEh)\r\n\t\tint rv = cudaGetLastError();\r\n00B41A8C  call        cudaGetLastError (0B488E8h)\r\n00B41A91  mov         dword ptr [rv],eax\r\n\t\tcudaMemcpy(h, d, 10*sizeof(double), cudaMemcpyDeviceToHost);\r\n00B41A94  push        2\r\n00B41A96  push        50h\r\n00B41A98  mov         ecx,dword ptr [d]\r\n00B41A9B  push        ecx\r\n00B41A9C  lea         edx,[h]\r\n00B41A9F  push        edx\r\n00B41AA0  call        cudaMemcpy (0B488FAh)\r\n&hellip;\r\ntemplate&lt;class t=&quot;&quot;&gt;\r\n__inline__ __host__ cudaError_t cudaMalloc(\r\n  T      **devPtr,\r\n  size_t   size\r\n)\r\n{\r\n00B44AE0  push        ebp\r\n00B44AE1  mov         ebp,esp\r\n00B44AE3  mov         eax,dword ptr [size]\r\n00B44AE6  push        eax\r\n00B44AE7  mov         ecx,dword ptr [devPtr]\r\n00B44AEA  push        ecx\r\n00B44AEB  call        cudaMalloc (0B48906h)\r\n  return cudaMalloc((void**)(void*)devPtr, size);\r\n}\r\n00B44AF0  pop         ebp\r\n00B44AF1  ret\r\n&hellip;\r\ncudaFree:\r\n00B488E2  jmp         dword ptr [__imp__cudaFree@4 (0CC7574h)]\r\ncudaGetLastError:\r\n00B488E8  jmp         dword ptr [__imp__cudaGetLastError@0 (0CC7570h)]\r\ncudaThreadSynchronize:\r\n00B488EE  jmp         dword ptr [__imp__cudaThreadSynchronize@0 (0CC756Ch)]\r\ncudaConfigureCall:\r\n00B488F4  jmp         dword ptr [__imp__cudaConfigureCall@32 (0CC7568h)]\r\ncudaMemcpy:\r\n00B488FA  jmp         dword ptr [__imp__cudaMemcpy@16 (0CC7564h)]\r\ncudaSetDevice:\r\n00B48900  jmp         dword ptr [__imp__cudaSetDevice@4 (0CC7560h)]\r\ncudaMalloc:\r\n00B48906  jmp         dword ptr [__imp__cudaMalloc@8 (0CC755Ch)]\r\ncuMemGetInfo:\r\n00B4890C  jmp         dword ptr [__imp__cuMemGetInfo@8 (0CC75C4h)]\r\n&hellip;\r\n10002EFE  int         3\r\n10002EFF  int         3\r\n10002F00  push        ebp\r\n10002F01  mov         ebp,esp\r\n10002F03  push        0FFFFFFFFh\r\n10002F05  push        10023E40h\r\n10002F0A  mov         eax,dword ptr fs:[00000000h]\r\n10002F10  push        eax\r\n10002F11  sub         esp,0Ch\r\n10002F14  push        ebx\r\n10002F15  push        esi\r\n10002F16  push        edi\r\n10002F17  mov         eax,dword ptr ds:[1003124Ch]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify; \">The algorithm for hooking is simple, but an understanding depends on knowing several Windows functions and data structures (Figure 2).&nbsp; Basically, given a module and function to hook, it finds all locations in memory that reference that function and substitutes the contents with the address of the wrapper function.<\/p>\n<p>Figure 2:&nbsp;<strong>Hooking<\/strong>(String&nbsp;<em>callee-module<\/em>, String&nbsp;<em>callee-function<\/em>, Pointer&nbsp;<em>subst<\/em>)<\/p>\n<pre class=\"brush: cpp; first-line: 1\">Make sure callee-module is loaded.  Otherwise, there is no IAT to patch in any caller module.\r\n\r\n\/\/ Using Windows LoadLibraryA and GetProcAddress:\r\nPointer previous-address = the address in memory of the callee-function\r\n\r\n\/\/ Using Windows GetCurrentProcessId, OpenProcess, and EnumProcessModules:\r\nHMODULE module-array[ ] = &hellip;;\r\n\r\nforeach (HMODULE caller-module in module-array)\r\n{\r\n    \/\/ Using caller-module and Windows ImageDirectoryEntryToData, get an\r\n    \/\/ array of IMAGE_IMPORT_DESCRIPTOR.\r\n    IMAGE_IMPORT_DESCRIPTOR caller-module-array[ ] = &hellip;;\r\n\r\n    foreach (IMAGE_IMPORT_DESCRIPTOR import-description in caller-module-array)\r\n    {\r\n        String callee =  import-description.Name;\r\n        if (callee == callee-module)\r\n        {\r\n            IMAGE_THUNK_DATA callee-iat-array[ ] = import-description.FirstThunk;\r\n            {\r\n                foreach (IMAGE_THUNK_DATA callee-iat in callee-iat-array)\r\n                {\r\n                    Pointer callee-addr = callee-iat.Function;\r\n                    if (callee-addr != previous-address)\r\n                        continue;\r\n                    Pointer ptr-callee-addr = &amp; callee-iat.Function;\r\n                    \/\/ Using Windows VirtualQuery, set protection level of\r\n                    \/\/ memory containing IAT so it can be written to.\r\n                    * ptr-callee-addr = subst;\r\n                }\r\n            }\r\n        }\r\n    }\r\n}\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>DLL injection<\/strong><\/p>\n<p style=\"text-align: justify; \">In order to add the hooking code to a user program, the code must be packaged into a DLL, and the DLL injected into another process.&nbsp; Injection is a two-step process.&nbsp; The first step is to create a chunk of memory in a target process, then copy code that loads the DLL.&nbsp; The second step is to execute this code.&nbsp; The problem is when and how to create the process. &nbsp;If you want to perform hooking of calls that could potentially occur during the initialization of that process, you must create the process in a suspended state so that DLL injection can be done prior to the program running.&nbsp; The most difficult part of this algorithm is creating actual x86 instructions to copy into the target process.&nbsp; The algorithm for DLL injection is shown in Figure 3.<\/p>\n<p>Figure 3: <strong>&nbsp;DLL-injection<\/strong>(String program-name, String dll-name)<\/p>\n<pre class=\"brush: cpp; first-line: 1\">Given the name of the program, create target process in suspended state using Windows CreateProcess, Int process-id = &hellip;;\r\n\r\nUsing OpenProcessToken, LookupPrivilegeValue, AdjustTokenPrivileges, adjust privileges of host process in order to write to target process;\r\n\r\nUsing GetThreadContext, get pointer to start of the program, Pointer originalEip;\r\n\r\nUsing OpenProcess, VirtualAllocEx, and WriteProcessMemory, create and write out dll-name into memory.  Let the address of that memory be target-dll-name.\r\n\r\nUsing VirtualAllocEx, and WriteProcessMemory, create and write out code that performs a LoadLibraryA(target-dll-name), and a jump to the start of the program, originalEip;\r\n\r\nResume the target process.\r\n\r\nOptional: Wait for the process to end before finishing host process.\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Conclusion<\/strong><\/p>\n<p style=\"text-align: justify; \">Writing a simple wrapper for <em>cudaMalloc <\/em>and <em>cudaFree <\/em>is fairly easy, once you understand how to implement DLL injection and API hooking. &nbsp;Unfortunately, what should be a well-described technique is actually hard to find. &nbsp;The algorithms listed in pseudo code are hard to find, if not non-existant. &nbsp;This article describes the two algorithms, free of the unnecessary clutter of other abstractions that developers add for full-featured programs.<\/p>\n<p><strong>Code<\/strong><\/p>\n<p style=\"text-align: justify; \">Code that demonstrates the two algorithms is here: <a href=\"http:\/\/domemtech.com\/code\/injection-and-hooking.zip\">http:\/\/domemtech.com\/code\/injection-and-hooking.zip<\/a>, which is a Visual Studio 2010 project. &nbsp;The CUDA memory debugger can be found at&nbsp;<a href=\"http:\/\/code.google.com\/p\/cuda-memory-debug\/\">http:\/\/code.google.com\/p\/cuda-memory-debug\/<\/a>.<\/p>\n<p><strong>References<\/strong><\/p>\n<ol>\n<li><a name=\"ref1\"><\/a><a href=\"http:\/\/developer.nvidia.com\/object\/cuda_3_1_downloads.html\">http:\/\/developer.nvidia.com\/object\/cuda_3_1_downloads.html<\/a><\/li>\n<li><a name=\"ref-nsight\"><\/a><a href=\"http:\/\/developer.nvidia.com\/object\/nsight.html\">http:\/\/developer.nvidia.com\/object\/nsight.html<\/a><\/li>\n<li><a name=\"ref-ocelot\"><\/a><a href=\"http:\/\/code.google.com\/p\/gpuocelot\/\">http:\/\/code.google.com\/p\/gpuocelot\/<\/a><\/li>\n<li><a name=\"ref-boundschecker\"><\/a><a href=\"http:\/\/en.wikipedia.org\/wiki\/BoundsChecker\">http:\/\/en.wikipedia.org\/wiki\/BoundsChecker<\/a><\/li>\n<li><a name=\"ref-dmalloc\"><\/a>http:\/\/dmalloc.com\/<\/li>\n<li><a name=\"ref-dll-injection\"><\/a>DLL Injection,&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/DLL_injection\">http:\/\/en.wikipedia.org\/wiki\/DLL_injection<\/a><\/li>\n<li><a name=\"ref-hooking\"><\/a>Hooking,&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Hooking\">http:\/\/en.wikipedia.org\/wiki\/Hooking<\/a><\/li>\n<li><a name=\"ref-three-ways\"><\/a>Three Ways To Inject Your Code Into Another Process,&nbsp;Robert Kuster, August 4, 2003,&nbsp;<a href=\"http:\/\/www.codeguru.com\/Cpp\/W-P\/system\/processesmodules\/article.php\/c5767\">http:\/\/www.codeguru.com\/Cpp\/W-P\/system\/processesmodules\/article.php\/c5767<\/a>; also&nbsp;&nbsp;<a href=\"http:\/\/www.codeproject.com\/KB\/threads\/winspy.aspx\">http:\/\/www.codeproject.com\/KB\/threads\/winspy.aspx<\/a><\/li>\n<li><a name=\"ref-class-lib\"><\/a>API hooking revealed, Ivo Ivanov, Dec 3, 2002,&nbsp;<a href=\"http:\/\/www.codeproject.com\/KB\/system\/hooksys.aspx\">http:\/\/www.codeproject.com\/KB\/system\/hooksys.aspx<\/a><\/li>\n<li><a name=\"ref-only-kern\"><\/a>Hooks and DLLs, Joseph M. Newcomer, April 1, 2001, <a href=\"http:\/\/www.codeproject.com\/KB\/DLL\/hooks.aspx\">http:\/\/www.codeproject.com\/KB\/DLL\/hooks.aspx<\/a><\/li>\n<li><a name=\"ref-detours\"><\/a>http:\/\/research.microsoft.com\/en-us\/projects\/detours\/<\/li>\n<li><a name=\"ref-iat-hooking\"><\/a>http:\/\/sandsprite.com\/CodeStuff\/IAT_Hooking.html<\/li>\n<li><a name=\"ref-ninjectlib\"><\/a>http:\/\/newgre.net\/ninjectlib<\/li>\n<li><a name=\"ref-berdajs\"><\/a>Berdajs, J. and Bosni\u00c4\u0087, Z. (2010), Extending applications using an advanced approach to DLL injection and API hooking. Software: Practice and Experience, 40:&nbsp;567&ndash;584. doi:&nbsp;10.1002\/spe.973<\/li>\n<li><a name=\"ref-good-1\"><\/a>A (working) implementation of API hooking (Part I),&nbsp;<a href=\"http:\/\/www.codeproject.com\/KB\/system\/APIHookingRevisited.aspx?msg=3186060\">http:\/\/www.codeproject.com\/KB\/system\/APIHookingRevisited.aspx?msg=3186060<\/a><\/li>\n<li><a name=\"ref-good-2\"><\/a>DLL Injection and function interception tutorial, <a href=\"http:\/\/www.codeproject.com\/KB\/DLL\/DLL_Injection_tutorial.aspx\">http:\/\/www.codeproject.com\/KB\/DLL\/DLL_Injection_tutorial.aspx<\/a><\/li>\n<\/ol>\n<p><script type=\"text\/javascript\">\n     SyntaxHighlighter.all()\n<\/script><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; As a software developer, you may be faced with the problem of modifying the behavior of a third-party program. &nbsp;For example, you&#39;re writing a debugger that will check the usage of memory allocation in the program being debugged. &nbsp;You often do not have the source for the program, so you cannot change it. &nbsp;And, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2010\/08\/13\/writing-a-wrapper-for-the-cuda-memory-allocation-api\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Writing a wrapper for the CUDA memory allocation API&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/504"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=504"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/504\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=504"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=504"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}