Last month, I took a short course called Introduction to Multicore Programming, which was taught by James T. Demers of Physical Sciences Inc. This course introduced me to current hardware and software systems for parallel computing. Personally, the offering of this course was timely: I have been reading about how important parallel programming is becoming; and, I was starting to become interested in parallel algorithms but my knowledge of parallel computing was embarrassingly sparse. The last time I looked at the subject was in the early 1990’s when I wrote a program that used MPI. Though I have an Intel Q6600 quad-core CPU machine (Figure 1, Figure 2), a byproduct of the rivalry between Intel and AMD, I never really bothered to program it for parallel computing because I did not think four cores would offer that much over one core. What I learned from this course was that I had my head in the sand. In fact, I was surprised to learn that that the graphics card I owned, an NVIDIA GeForce 9800 GT (Figure 3), was a parallel computing system which I could program. So, I decided to apply what I learned in the class by solving two programming exercises on my system (Figure 4): matrix multiplication and graph breadth-first search. In this post, I describe the first problem, matrix multiplication.

	Figure 1. A snapshot of the hardware: CPU.
	Figure 2. A snapshot of the hardware: motherboard.
	Figure 3. A snapshot of the hardware: GPU.
	Figure 4. Overview of the hardware test system.

The fact that graphics cards could be used for anything other than video games or 3D modeling was a revelation to me. Graphics Processing Units (GPU), the hardware that performs specialized graphics computations, have been in existence since the mid 1970’s [3]. At that time, the operations were primitive, and the hardware slow. These processors manipulated the position of bitmap images called “sprites” [4] on a graphics screen for video games. In the 1990’s, as Windows gained popularity, the capability to manipulate 2-dimensional bitmap graphic objects such as lines, boxes, circles, regions and images were added to GPU’s, and the software to manipulate these objects, Microsoft’s DirectDraw, took advantage of these hardware-accelerated computations. Intel introduced the Peripheral Component Interconnect (PCI) bus in 1993 to help increase the bandwidth of the I/O that GPU’s needed. By the mid 1990’s, the use of 3-dimensional graphics for video gaming was increasing, and GPU’s were now rendering 3D models (IBM Professional Graphics Controller (1984), S3 Virtual Reality Graphics Engine (1995), ATI 3D RAGE (1996), NVIDIA RIVA 128 (1997)). Communication of peripherals along the PCI bus became inadequate because of large data transfers needed by all peripherals and because the GPU needed a large chunk of that. In 1996, Intel developed the Accelerated Graphics Port (AGP) bus to provide a faster and dedicated bus for the GPU. As the processing requirements for the GPU grew, the capacity of the AGP bus became inadequate, and in 2004, the PCI Express was introduced. In 1999, NVIDIA introduced a GPU unit with transform, clipping, and lighting, which they call the first “GPU.” A few years later, the capability to program the GPU was added for pixel and vertex shading. Multicore GPU’s were now being introduced: “The driving force continues to be the real-time graphics performance needed to render complex, high-resolution 3D scenes at interactive frame rates for games” [5]. In the early 2000’s, General-Purpose computation on Graphics Processing Units (GPGPU) were being considered [6]. Nevertheless, I never heard about this because I thought that GPU’s were only for graphics, and I was never particularly interested in programming video games.

The Tesla Architecture

In 2006, NVIDIA altered the GPU industry with the introduction of the Tesla Architecture for GPGPU [7] [8]. The first chip to use this architecture was the GeForce 8800 GTS. The Tesla consists of a collection of hundreds of cores, each known as a stream processor (SP). Eight stream processors are organized into one streaming multiprocessor (SM); two streaming multiprocessors are organized into one multiprocessor (MP, also called a texture/processor cluster TPC); and a variable number of TPC’s are organized in a GPU. For a GeForce 9800 GT GPU, there are seven TPC’s (Figure 5). With each TPC containing 16 streaming multiprocessors (Figure 6), there are 112 stream processors in the GPU.

	Figure 5. Architecture of the GeForce 9800 GT [9].
	Figure 6. The Multiprocessor (MP) of the GeForce 9800 GT.

Parallelism is achieved through single instruction multiple thread (SIMT) processing. In SIMT, the SM manages groups of threads called warps. Threads in a warp execute the same code, called kernels, in parallel, starting from the same initial instruction. However, threads can execute different instructions if there is a data-dependent branch within the code. In this case, threads in the warp with a common instruction execute together, while threads with a different common instruction executed in another cycle. As such, if there is branching to different instructions, serialization of sets of threads in a warp will occur. For this reason, to achieve high performance, branching should be carefully considered.

The Tesla Architecture has six different types of memory accesses: registers, shared memory, constant, texture (L2), global, and CPU motherboard memory (Table 1). Registers, shared memory, constant, and texture (L2) memory are located on the GPU in the streaming multiprocessor. Global memory is located off the GPU. The Tesla can access memory associated with the CPU motherboard, but the speed is dependent on the speed of the Northbridge chipset used in the motherboard.

Table 1. Types of Memory of the Tesla [12].

Shared, constant, and texture memory are shared among the eight streaming processors, and access is very quick. If possible, threads should utilize this fast memory to improve performance, but it is not always easy to do.

The instruction set for the Tesla is not published, but probably contains a wide variety of operations. Instead, NVIDIA defines the PTX Assembler language [13], which is translated into the native Tesla instruction set at runtime. The PTX instruction set includes load, store, add, subtract, multiply, divide, test, branch, and, or, xor, compare and swap (CAS), and barrier synchronize, just to name a few [14].

The CUDA Programming Model

The Compute Unified Device Architecture (CUDA) programming model follows a master/slave coprocessor model [11]. In this model, the GPU, called the “device”, is dependent on instructions from the CPU, called the “host”. A driver program resides on the CPU, and offloads parallel computations to the GPU using a C/C++ API [15]. First, data for the GPU is copied to the GPU’s memory. Then, the GPU computation, a kernel, is loaded into the GPU and started. The CPU continues asynchronously while the GPU performs its computations. Finally, at some point, the CPU synchronizes with the GPU computation, and then retrieves the results from the GPU memory.

For the programmer, instead of having to work directly with warps, a structure of threads is used in the CUDA programming model. Threads are organized into a block, each called a Cooperative Thread Array (CTA), and blocks are organized into a grid (see Figure 7). Blocks are three-dimensional structures of threads. Each thread is addressed by a unique coordinate (x, y, z). Grids are two-dimensional structures of blocks. The maximum number of threads per block is 512 (or higher for some newer chips), and the maximum dimension of the grid is (65535, 65535) blocks. Each block is addressed by a unique coordinate (x, y).

A thread is self referential. The kernel code can contain references to the block and thread coordinates for the thread. The variables blockIdx.x, blockIdx.y, blockIdx.z are the coordinates of the block that the thread belongs to; the variables threadIdx.x and threadIdx.y are the coordinates of the thread within the block. Using these variables, the kernel code can partition data so each thread operates on different data.

For example (see Figure 7), if the block size is (1+n) columns by (1+m) rows, then block (0,0) contains threads numbered (0,0), (0,1), (0,2), …, (0,n) of the first row, (m,0), (m,1), (m,2), …, (m,n) of the last row in the block.

Figure 7. A two-dimensional grid of blocks, and a block of threads. Each block has z = 1.

Threads in a block can communicate with each other, use barrier synchronization, and can share memory. A basic CUDA program has these steps:

Allocate device memory (cudaMalloc);
Copy initial data into the device memory (cudaMemcpy);
Call the kernel in a syntax that is similar to a procedure call: (e.g., myprocedure<<<grid, block>>>());
Synchronize with the kernel (cudaThreadSynchronize);
Copy results from the device to host (cudaMemcpy).

Matrix multiply

In the course, the instructor introduced us to GPU computing with CUDA using the example of matrix multiplication. This example is described in the NVIDIA documents [5] and the GPU Computing SDK code samples (http://developer.nvidia.com/object/cuda_3_0_downloads.html). But there are several issues with the example:

The example, while very efficient, is difficult to understand because it uses shared memory and thread synchronization. There are at least two other solutions which are simpler, both of which do not use shared memory. One implementation is shown in the NVIDIA CUDA™ Programming Guide, but the other is not.
In the example, the base type of the matrices is floating point. For comparison, the base type of the matrix should be abstracted using templates.
The example breaks up the problem into blocks of a fixed size of 16 by 16 threads. Does the size of the block matter? In order to test this, the block size should be abstracted.
The example is in C, not C++. While perfectly acceptable, I wondered why none of the examples in the GPU Computing SDK are in C++. What problems are there in using C++ classes?
None of the algorithms are described very well. This document tries to describe the code at a level of detail I need in order to re-familiarize myself with the problem if I pick it up months later.

The code presented here is a C++ template that has four solutions to matrix multiplication: serial (Multiply_Host), simple one column parallel (Multiply_Simple), simple tiled parallel (Multiply_Simple_Tile), and shared memory tiled parallel (Multiply_Fancy_Tile), in file matrix.cuh. The type T is a parameterized type for the base type of the matrix (e.g., int or float).

In this template implementation, instances of the class Matrix are constructed through a factory. A Boolean is passed to the factory (line 539) to indicate whether the instance is in host space or device space.

    static Matrix * Factory(bool _host, int w, int h)
    {
        Matrix * m = 0;
        if (_host)
        {
            m = (Matrix*)malloc(sizeof(Matrix));
            if (m == 0)
                return 0;
            m->data = (T*)malloc(w * h * sizeof(T));
            if (m->data == 0)
                return 0;
            m->width = w;
            m->height = h;
            m->host = _host;
        }
        else
        {
            // Creating this object is kind of tricky.
            if (Check_CUDA_Error(cudaMalloc((void**) &m, sizeof(Matrix)), "Alloc"))
                return 0;
            Add_Ptr_In_Device_Space(m);
            T * da;
            if (Check_CUDA_Error(cudaMalloc((void**) &da, sizeof(T)*w*h), "Alloc"))
                return 0;
            Add_Ptr_In_Device_Space(da);
            // Auto allocated Matrix will not work because it won't be freed correctly.
            // Allocate heap-based Matrix.
            Matrix * hm = (Matrix*)malloc(sizeof(Matrix));
            hm->host = _host;
            hm->width = w;
            hm->height = h;
            hm->data = da;
            if (Check_CUDA_Error(cudaMemcpy((void*)m, hm, sizeof(Matrix), cudaMemcpyHostToDevice), "Memcpy"))
                return 0;
            free(hm);
        }
        return m;
    };

In addition, instances of the class are copied using a function (Copy, lines 467-537) that takes into account the four cases of copying: host to host, host to device, device to host, and device to device.

    static bool Copy(Matrix<T> * dst, Matrix<T> * src)
    {
        bool src_host;
        bool dst_host;
        if (! Is_In_Device_Space((void*)dst))
            dst_host = true;
        else
            dst_host = false;
        if (! Is_In_Device_Space((void*)src))
            src_host = true;
        else
            src_host = false;

        // four cases possible, give device or host space matrices.
        if (dst_host && src_host)
        {
            if (!(src->width == dst->width && src->height == dst->height))
            {
                dst->width = src->width;
                dst->height = src->height;
                free(dst->data);
                dst->data = (T*)malloc(sizeof(T) * src->width * src->height);
            }
            int size = dst->width * dst->height;
            for (int i = 0; i < size; ++i)
                dst->data[i] = src->data[i];
        }
        else if (dst_host && ! src_host)
        {
            // First, get local copy of source matrix.
            Matrix * local = (Matrix*)malloc(sizeof(Matrix));
            if (Check_CUDA_Error(cudaMemcpy((void*)local, src, sizeof(Matrix), cudaMemcpyDeviceToHost), "Memcpy"))
                return 0;
            if (!(local->width == dst->width && local->height == dst->height))
            {
                dst->width = local->width;
                dst->height = local->height;
                free(dst->data);
                dst->data = (T*)malloc(sizeof(T) * local->width * local->height);
            }
            // Copy from device to host matrix product.
            if (Check_CUDA_Error(cudaMemcpy(dst->data, local->data, sizeof(T) * local->width * local->height, cudaMemcpyDeviceToHost), "MemcpySimple"))
                return false;
        }
        else if (src_host && ! dst_host)
        {
            // First, get local copy of source matrix.
            Matrix * local = (Matrix*)malloc(sizeof(Matrix));
            if (Check_CUDA_Error(cudaMemcpy((void*)local, dst, sizeof(Matrix), cudaMemcpyDeviceToHost), "MemcpyA"))
                return 0;
            if (!(local->width == src->width && local->height == src->height))
            {
                local->width = src->width;
                local->height = src->height;
                if (local->data != 0)
                    cudaFree(local->data);
                if (Check_CUDA_Error(cudaMalloc((void**) &local->data, sizeof(T) * local->width * local->height), "Alloc"))
                    return false;
                Add_Ptr_In_Device_Space(local->data);
                if (Check_CUDA_Error(cudaMemcpy((void*)dst, local, sizeof(Matrix), cudaMemcpyDeviceToHost), "MemcpyB"))
                    return false;
            }
            if (Check_CUDA_Error(cudaMemcpy((void*)local->data, src->data, sizeof(T) * local->width * local->height, cudaMemcpyHostToDevice), "MemcpyC"))
                return false;
        }
        else
        {
            return false;
        }
        return true;
    };

The serial solution

The serial solution of matrix multiplication is based on straight forward implementation of the definition. In this solution, three matrices are passed to the function Multiply_Host: A, B, and C, where A and B are the operands of the operation, and C is the result. The implementation computes C_ij by summing the products A_ik * B_kj for all k (0 .. Width(A)-1) (line 198), and for all i (0 .. Height(C)-1) and j (0 .. Width(C)). The function returns true to indicated that there were no errors in the computation.

In Figure 8, the element C₁₂ = A₁₀*B₀₂ + A₁₁*B₁₂ + A₁₂*B₂₂ + A₁₃*B₃₂ + A₁₄*B₄₂ + A₁₅*B₅₂. All C_ij are computed in a similar manner.

    static bool Multiply_Host(Matrix<T> * C, Matrix<T> * A, Matrix<T> * B)
    {
        int hA = A->height;
        int wA = A->width;
        int wB = B->width;
        for (int i = 0; i < hA; ++i)
            for (int j = 0; j < wB; ++j) {
                T sum = 0;
                for (int k = 0; k < wA; ++k) {
                    T a = A->data[i * wA + k];
                    T b = B->data[k * wB + j];
                    sum += a * b;
                }
                C->data[i * wB + j] = sum;
            }
        return true;
    };

Figure 8. Matrix multiplication via the Simple Parallel method.

The simple parallel solution

One obvious parallelization of the serial algorithm is to perform all C_ij computations in parallel threads. The simple parallel solution is an implementation based on this approach. A grid is constructed which contains one block of Width(C) by Height(C) threads. Two procedures are defined: one for the host (Multiply_Simple, lines 205-238), and the other for the device (Kernel_Matrix_Multiply_Simple, lines 9-28). The implementation computes C_ij by summing the products A_ik * B_kj for all k (line 24), where i is the current row (line 17) and j is the current column (line 16). The function Multiply_Simple returns a Boolean derived from the calls to CUDA. If it returns false, an error occurred in the computation, either because the block size is too large for the kernel call, or because the matrices are too large to be allocated by cudaMalloc. The type modifier __global__ means that the function code resides on the device and is only callable from the host (line 247).

In Figure 9, the thread tid₁₂ (lines 15-16) is used to identify which C_ij instance to compute. There is only one block in the grid (line 230), and is the size of the matrix (line 229). Based on this information, C₁₂ is computed as the sum of the products between rows and columns (lines 19-24). All C_ij are computed in a similar manner in parallel.

This solution has one restriction: the maximum size of the block cannot exceed 512 threads. Therefore, the size of the matrix is very limited.

template <class T>
__global__ void Kernel_Matrix_Multiply_Simple(Matrix<T> * C, Matrix<T> * A, Matrix<T> * B)
{
    int wA = A->width;
    int wB = B->width;
    int wC = C->width;
    // 2D Thread ID
    int col = threadIdx.x;
    int row = threadIdx.y;
    // Pvalue stores the Pd element that is computed by the thread
    T Pvalue = 0;
    for (int k = 0; k < wA; ++k)
    {
        T Aelement = A->data[row * wA + k];
        T Belement = B->data[k * wB + col];
        Pvalue += Aelement * Belement;
    }
    // Write the matrix to device memory each thread writes one element
    C->data[row * wC + col] = Pvalue;
}

…

    static bool Multiply_Simple(Matrix<T> * C, Matrix<T> * A, Matrix<T> * B)
    {
        // Assert that A, B, and Product are host-space matrices.
        if (Is_In_Device_Space(A) || Is_In_Device_Space(B) || Is_In_Device_Space(C))
            return false;

        int wA = A->width;
        int hA = A->height;
        int wB = B->width;
        int hB = wA;
        int wC = wB;
        int hC = hA;

        // Copy host-space matrices to device-space matrices.
        Matrix * d_A = Matrix::Factory(false, wA, hA);
        if (d_A == 0)
        {
            std::cout << "Cannot allocate matrix\n";
            return false;
        }
        Matrix * d_B = Matrix::Factory(false, wB, hB);
        if (d_B == 0)
        {
            std::cout << "Cannot allocate matrix\n";
            return false;
        }
        Matrix::Copy(d_A, A);
        Matrix::Copy(d_B, B);
        Matrix * d_C = Matrix::Factory(false, wC, hC);
        if (d_C == 0)
        {
            std::cout << "Cannot allocate matrix\n";
            return false;
        }

        // Try host page locked memory if required.
        std::cout << "Simple matrix multiply\n";

        // setup execution parameters
        dim3 threads(wC, hC);
        dim3 grid(1,1);

        Kernel_Matrix_Multiply_Simple<T><<< grid, threads >>>(d_C, d_A, d_B);
        cudaThreadSynchronize();
        cudaError_t err = cudaGetLastError();
        if (Check_CUDA_Error(err, "kernel call error"))
            return false;
        return Matrix::Copy(C, d_C);
    };

Figure 9. Matrix multiplication via the Simple Parallel method.

The simple tiled parallel solution

In the simple parallel solution, matrices of only 512 elements could be multiplied. In the simple tiled parallel solution, the problem is broken into a number of blocks, each block covering portion of the matrix C. The size of the block is assumed to be an integral number that tile the grid (lines 255-268). However, this is not necessary because the kernel code checks for an out of bounds element of C (lines 45-46). Two procedures are defined: one for the host (Multiply_Simple_Tile, lines 255-320), and the other for the device (Kernel_Matrix_Multiply_Simple_Tile, lines 31-59). The implementation computes C_ij by summing the products A_ik * B_kj for all k (line 54), where i is the current row (line 38) and j is the current column (line 35).

In Figure 10, there are four blocks in the grid, and the kernel code uses both the block id and thread id to determine which element to compute.. For thread tid₁₀ of block bid₀₁, element C₁₂is computed, which is A₁₀*B₀₂ + A₁₁*B₁₂ + A₁₂*B₂₂ + A₁₃*B₃₂ + A₁₄*B₄₂ + A₁₅*B₅₂. All C_ij are computed in a similar manner in parallel.

template <class T>
__global__ void Kernel_Matrix_Multiply_Simple_Tile(int wTile, int hTile, Matrix<T> * C, Matrix<T> * A, Matrix<T> * B)
{
    // get column number (x).
    int tx = threadIdx.x + blockIdx.x * wTile;

    // get row number (y).
    int ty = threadIdx.y + blockIdx.y * hTile;

    int wA = A->width;
    int wB = B->width;
    int wC = C->width;

    // Bounds checking...
    if (tx >= C->width || ty >= C->height)
        return;

    // Pvalue stores the Pd element that is computed by the thread
    T Pvalue = 0;
    for (int k = 0; k < wA; ++k)
    {
        T Aelement = A->data[ty * wA + k];
        T Belement = B->data[k * wB + tx];
        Pvalue += Aelement * Belement;
    }

    // Write the matrix to device memory each thread writes one element
    C->data[ty * wC + tx] = Pvalue;
}

…

    static bool Multiply_Simple_Tile(Matrix<T> * C, Matrix<T> * A, Matrix<T> * B, int wTile, int hTile)
    {
        // Assert that A, B, and Product are host-space matrices.
        if (Is_In_Device_Space(A) || Is_In_Device_Space(B) || Is_In_Device_Space(C))
            return false;

        int wA = A->width;
        int hA = A->height;
        int wB = B->width;
        int hB = wA;
        int wC = wB;
        int hC = hA;

        // Note if the tile is not sized so that an integral number fit exactly into the
        // product, return error.
        if (wC % wTile != 0)
        {
            std::cout << "Tile width " << wTile
                << " does not divide matrix width " << wC
                << " evenly.  Try a different wTile size\n";
            return false;
        }
        if (hC % hTile != 0)
        {
            std::cout << "Tile height " << hTile
                << " does not divide matrix height " << hC
                << " evenly.  Try a different hTile size\n";
            return false;
        }

        // Copy host-space matrices to device-space matrices.
        Matrix * d_A = Matrix::Factory(false, wA, hA);
        if (d_A == 0)
        {
            std::cout << "Cannot allocate matrix\n";
            return false;
        }
        Matrix * d_B = Matrix::Factory(false, wB, hB);
        if (d_B == 0)
        {
            std::cout << "Cannot allocate matrix\n";
            return false;
        }
        Matrix::Copy(d_A, A);
        Matrix::Copy(d_B, B);
        Matrix * d_C = Matrix::Factory(false, wC, hC);
        if (d_C == 0)
        {
            std::cout << "Cannot allocate matrix\n";
            return false;
        }
        std::cout << "Simple tile matrix multiply\n";

        // setup execution parameters
        // Divide the matrix into tiles, which is passed into this routine.

        dim3 threads(wTile, hTile);
        dim3 grid(wC / wTile, hC / hTile);

        Kernel_Matrix_Multiply_Simple_Tile<T><<< grid, threads >>>(wTile, hTile, d_C, d_A, d_B);
        cudaThreadSynchronize();
        cudaError_t err = cudaGetLastError();
        if (Check_CUDA_Error(err, "kernel call error"))
            return false;
        return Matrix::Copy(C, d_C);
    };

Figure 10. Matrix multiplication via the Simple Tiled Parallel method.

The shared memory tiled parallel solution

In this implementation, shared memory is used to increase performance. In the previous solution (simple tiled parallel), threads that are in the same row i of C_ij will load row i of A Width(C) times, and threads in the same column j of C_ij will load column j of B Width(C) times. In the shared tiled parallel solution, threads cooperate by loading only one element of A and B per thread per block. Two procedures are defined: one in the host code (Multiply_Fancy_Tile, lines 322-387), and the other in the device (Kernel_Matrix_Multiply_Fancy_Tile, lines 62-155). The implementation computes C_ij by first loading a tile of A and B cooperatively (lines 115-135), then summing the products A_ik * B_kj for all k within the block (lines 140-143), where i is the current row and j is the current column.

In Figure 11, there are four blocks in the grid. The thread tid₁₀ of block bid₀₁and computes C₁₂. Threads cooperate in loading one tile at a time, first with the tile in red. tid₁₀, tid₀₀, tid₀₁, tid₁₁ load in A₁₀, A₀₀, A₀₁, A₁₁ and B₁₂, B₀₂, B₀₃, B₁₃ respectively. A partial sum is computed for C₁₂ = A₁₀*B₀₂ + A₁₁*B₁₂. Then, the thread shifts to the next block (orange). Threads cooperated in loading the next tile. tid₁₀, tid₀₀, tid₀₁, tid₁₁ load in A₁₂, A₀₂, A₀₃, A₁₃ and B₃₂, B₂₂, B₂₃, B₃₃ respectively. The partial sum for C₁₂ = A₁₀*B₀₂ + A₁₁*B₁₂+ A₁₃*B₃₂ + A₁₄*B₄₂. The computation is finished with the last tile (yellow) performed in a similar manner.

template <class T>
__global__ void Kernel_Matrix_Multiply_Fancy(int wTile, int hTile, Matrix<T> * C, Matrix<T> * A, Matrix<T> * B)
{
    // x refers to the column number; y refers to the row number.

    // Define AS and BS.  Note i is row, j is column.
#define AS(i, j) As[i][j]
#define BS(i, j) Bs[i][j]

    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int wA = A->width;
    int wB = B->width;

    // Index of the first sub-matrix of A processed by the block.
    // Note that this starts at column 0, row hTile*by.
    int aBegin = wA * hTile * by;

    // Index of the last sub-matrix of A processed by the block.
    // We want to process the entire row of A, row number hTile*by.
    int aEnd = aBegin + wA - 1;

    // Step size used to iterate through the sub-matrices of A to the right of the
    // last tile.  This is just wTile columns to the right.
    int aStep = wTile;

    // A tile is a little different for matrix B.  In this situation, the notion
    // the the height and width are reverse that of A.  So it is wTile units high,
    // and hTile units wide.

    // Index of the first sub-matrix of B processed by the block.  This is the block id
    // column (x) times the width of the tile.
    int bBegin = wTile * bx;

    // Step size used to iterate through the sub-matrices of B
    int bStep  = wTile * wB;

    // Csub is used to store the element of the block sub-matrix
    // that is computed by the thread
    T Csub = 0;

    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep) {

        // The Cuda compiler is very inadequate for variable-sized shared blocks.
        // Declare a maximum and hope it doesn't crap out.  Alternatively, I could
        // pass an additional parameter on the kernel call to allocate shared memory,
        // but it can only handle one __shared__ variable.
#define MAX_BLOCK_SIZE 30
        // Declaration of the shared memory array As used to
        // store the sub-matrix of A
        __shared__ T As[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];

        // Declaration of the shared memory array Bs used to
        // store the sub-matrix of B
        __shared__ T Bs[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];

        // Load the matrices from device memory
        // to shared memory; each thread loads
        // one element of each matrix
        AS(ty, tx) = A->data[a + wA * ty + tx];
        BS(ty, tx) = B->data[b + wB * ty + tx];

        // Synchronize to make sure the matrices are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < wTile; ++k)
        {
            Csub += AS(ty, k) * BS(k, tx);
        }

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }

    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * hTile * by + wTile * bx;
    C->data[c + wB * ty + tx] = Csub;
};

…

    static bool Multiply_Fancy_Tile(Matrix * C, Matrix * A, Matrix * B, int wTile, int hTile)
    {
        // Assert that A, B, and Product are host-space matrices.
        if (Is_In_Device_Space(A) || Is_In_Device_Space(B) || Is_In_Device_Space(C))
            return false;

        int wA = A->width;
        int hA = A->height;
        int wB = B->width;
        int hB = wA;
        int wC = wB;
        int hC = hA;

        // Note if the tile is not sized so that an integral number fit exactly into the
        // product, return error.
        if (wC % wTile != 0)
        {
            std::cout << "Tile width " << wTile
                << " does not divide matrix width " << wC
                << " evenly.  Try a different wTile size\n";
            return false;
        }
        if (hC % hTile != 0)
        {
            std::cout << "Tile height " << hTile
                << " does not divide matrix height " << hC
                << " evenly.  Try a different hTile size\n";
            return false;
        }

        // Copy host-space matrices to device-space matrices.
        Matrix<T> * d_A = Matrix<T>::Factory(false, wA, hA);
        Matrix<T> * d_B = Matrix<T>::Factory(false, wB, hB);
        Matrix<T>::Copy(d_A, A);
        Matrix<T>::Copy(d_B, B);
        Matrix<T> * d_C = Matrix<T>::Factory(false, wC, hC);
        std::cout << "Fancy tile matrix multiply\n";

        // setup execution parameters
        // Divide the matrix into tiles, which is passed into this routine.

        dim3 threads(wTile, hTile);
        dim3 grid(wC / wTile, hC / hTile);

        Kernel_Matrix_Multiply_Fancy<T><<< grid, threads >>>(wTile, hTile, d_C, d_A, d_B);
        cudaThreadSynchronize();
        cudaError_t err = cudaGetLastError();
        if (Check_CUDA_Error(err, "kernel call error"))
            return false;
        return Matrix::Copy(C, d_C);
    };

Figure 11. Matrix multiplication via the Shared Memory Tiled Parallel method.

Comparing the runtime

The program was run on the test hardware (Figure 4), and the results are shown below (Figure 12). The runtime for parallel matrix multiplication is significantly better than the serial CPU implementation, with the shared memory tiled algorithm having the best runtime.

Figure 12. Results of Matrix multiplication.

Matrix<T>	CPU time	Tiled parallel	Shared memory tiled parallel
short	22.7 s	5.8 s	2.5 s
float	32.1 s	5.7 s	2.5 s

Code

You can find the code to this example at http://domemtech.com/code/CudaMatrixMultiply.zip.

Bibliography and Footnotes

Figure adapted from: Intel® G31/P31 Express Chipset Datasheet at http://www.intel.com/assets/pdf/datasheet/317495.pdf; Product Brief of the Intel® G31 Express Chipset at http://www.intel.com/assets/pdf/prodbrief/317838.pdf; and machine specific information from CPU-Z as captured in Figure 1 and Figure 2.
What is a GPU? This depends on your definition. NVIDIA claims the invention of the first GPU (http://www.nvidia.com/object/gpu.html, which is defined on their website: "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second."
Wikipedia.org. Graphics processing unit, 2010, Graphics Processing Unit.
Wikipedia.org. Sprite (computer graphics), 2010.
Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53.
Macedonia, M. The GPU Enters Computing's Mainstream. COMPUTER. 106-108.
Lindholm, E., Nickolls, J., Oberman, S. and Montrym, J. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro. 39-55.
The exact details of this computer architecture are still unknown four years after the introduction of the chip. There are no detailed programming documents corresponding to the Intel architectures (http://www.intel.com/products/processor/manuals/index.htm) that describe registers, instruction sets, etc. The information here is derived from a variety of sources.
Information derived from multiple sources: [10], [6], [11], and data obtained from GPU-Z in Figure 3.
Cecilia, J., García, J., Guerrero, G., Martínez–del–Amor, M., Pérez–Hurtado, I. and Pérez–Jiménez, M., Implementing P Systems Parallelism by Means of GPUs. in 10th International Workshop Membrane Computing, (Curtea de Arges, Romania, 2010), Springer, 227-241.
Wikipedia.org. Asymmetric multiprocessing, 2010.
This table was culled from a number of sources: [10], [6], and [11].
NVIDIA Parallel Thread Execution, ISA Version 1.4, Santa Clara, CA 95050, 2009.
The PTX instructions as generated from the CUDA C++ compiler can be viewed using the compiler switch “-keep". Note: only GPU architectures sm_11 or above will accept atomic operations. To set these options, open the Property page for the project in Visual Studio, then find and change the CUDA build rule options. The output after a build will be located in files with a “.ptx” suffix.
NVIDIA NVIDIA CUDA™ Programming Guide Version 3.0, 2010.

Parallel programming using CUDA