Reboot …

After a long hiatus from this blog, I am returning to write about programming, C++, C#, PHP, Java, Linux, Windows griping, and other tidbits. Several years have passed and technologies have matured–ever so slightly on the desktop, IMHO. I have started a new project related to GPGPU computing: Campy.NET, a “better” API to access the GPU for CLI programs on Windows. Please check out: http://campynet.com. I hope to write as soon as I understand how the latest VirtualBox release (4.3.20) does not boot one of my virtual disk images, and no longer recognizes the Intel Virtualization setting in my PC’s BIOS. Sigh.

There is much news with regards to hardware and software. Intel’s Broadwell will be releasing various versions of the chip over the next year; NVIDIA now has CUDA 7.0, the Maxwell GPU, and soon to release the large–and probably over-the-top expensive–GM200-400-A1; AMD is limping on and improving its Graphics Core Next processors. Today, I read an interesting article on how Intel lost the mobile business with Apple and the iPhone–a good read like an Aesop’s fable.

–Ken

Syntactic sugar with C++ AMP

In CUDA, OpenCL, and C++ AMP, a group is a collection of threads that execute in parallel in lock-step fashion.  In CUDA, it is called a block; in OpenCL, it is called a work-group; in C++ AMP, it is called a tile.  The purpose of a group is to allow threads within the group to communicate with each other using synchronization and/or shared memory. The size of thread groups is set by the programmer, but hardware constraints limit the maximum size to 512 or 1024. While programmers usually need to tailor algorithms to be aware of thread groups, there are a few tricks that can make programming easier.

Continue reading “Syntactic sugar with C++ AMP”

Performance comparison of CUDA, OpenCL, and C++ AMP

Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult.  Companies will not publish design information because they do not want you or other companies to copy the technology.  But, sometimes you need to know details of a technology that are just not published in order to use it effectively.  If they won’t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2].  What is the performance of OpenCL, CUDA, and C++ AMP?  What can we learn from this information?

Continue reading “Performance comparison of CUDA, OpenCL, and C++ AMP”

Developing CUDA and C++ AMP in Visual Studio 2011

Every time Microsoft releases a new version of Visual Studio C++, NVIDIA releases a new version of CUDA to work with the new version of Visual Studio.  Unfortunately, that seemed to take NVIDIA an incredibly long time the last time. After Visual Studio 2010 was released in April 4, 2010, CUDA integration with with Visual Studio 2010 didn’t happen until CUDA 4.0 RC1 in March 4, 2011–a year later!  And, to this day, the build rules have never worked cleanly for me (http://stackoverflow.com/questions/6156037/issue-with-production-release-of-cuda-toolkit-4-0-and-nsight-2-0).  Because I’m developing C++ AMP and CUDA side by side, and cannot wait for NVIDIA, I decided to develop the build rules myself, and work out the details of calling CUDA from Visual Studio C++ 2012.

Continue reading “Developing CUDA and C++ AMP in Visual Studio 2011”

C++ AMP

At the AMD Fusion Developer Summit 2011 in June, Microsoft announced C++ Accelerated Massive Parallelism (C++ AMP), an extension to C++ for parallel programming of GPU’s.  This extension, included in Visual Studio 11, would allow developers to program the GPU with language features that are arguably much more powerful than either CUDA or OpenCL.  In comparison, CUDA and OpenCL seem more like stone knives and bearskins. After what seemed like an long three-month wait, Microsoft has finally released the Visual Studio 11 Developer Preview, which contains C++ Accelerated Massive Parallelism (C++ AMP), Microsoft’s vision of GPU programming.  Was it worth the wait?

Continue reading “C++ AMP”

Antlr PHP target

My wife recently started a WordPress blog on regulatory affairs of the drug and medical device industries.  When she started it, she decide to add posts from time to time of newly approved drugs and generics, based on information from the FDA website.  So, after a few times proof-reading her drafts, I realized that this was a good example of an aggregator problem, something that I've been wanting to do for my own blog.  The aggregator would read the Drugs@FDA and MedlinePlus websites, extract approved pharmaceuticals and generics, then create a draft post for each new drug.  Sounds like an easy project, right?

Continue reading “Antlr PHP target”

OpenCL vs. CUDA

Today's processors have undergone a huge transformation from those of just 10 years ago.  CPU manufacturers Intel and AMD (and up and coming CPU designer ARM) have increased processor speed via greater emphasis on superscalar execution, deeper pipelining, branch prediction, out of order execution, and fast multi-level caches.  This design philosophy has resulted in faster response time for single tasks executing on a processor, but at the expense of increased circuit complexity, high power consumption, and a small number of cores on the die.  On the other hand, GPU manufacturers NVIDIA and ATI have focused their designs on processors with many simple cores that implement SIMD parallelism, which hides latency of instruction execution [1].

While GPUs have been in existence for about 10 years, the software support for these processor have taken years to catch up.  Software developers are still sifting through solutions for programming these processors.  OpenCL and CUDA are frameworks for GPGPU computing.  Each framework comprises a language for expressing kernel code (instructions that run on a GPU), and an API for calling kernels (from the CPU).  While the frameworks are similar, there are some important differences.

CUDA is a proprietary framework. It is not open source, and all changes to the language and API are made by NVIDIA. But, some third-party tools have been built around the framework and it does seem to have a large following in academia.  Unfortunately, CUDA only runs on NVIDIA devices.  While it should be possible to run CUDA code on other platforms using Ocelot, this only works on Linux systems.
 
OpenCL is a standardized framework, and is starting to gain popularity.  Similar to NVIDIA's CUDA C++, OpenCL allows programmers to use the massive parallel computing power of GPU's for general purpose computing.  Unlike CUDA, OpenCL works on any supported GPU or CPU, including Intel, AMD, NVIDIA, IBM, and ARM processors. 
 
Does OpenCL make programming multiple platforms easier?  Is it as fast as CUDA, or does it sacrafice speed for diverse platform support?

Continue reading “OpenCL vs. CUDA”

WASTE — A CUDA Emulator

CUDA C++ is a parallel programming language for NVIDIA GPU’s.  Using the CUDA Toolkit (http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html), developers can write programs that take advantage of the large parallel computing potential of the GPU, speeding up their programs several orders of magnitude.  CUDA programs are executed on the GPU, which works as a coprocessor with the CPU, having its own memory and instruction set.  But, the question is, after the developer invested the time to parallelize his program, can the CUDA program run on a PC without an NVIDIA GPU?   Does the developer have to redo all his software?

Continue reading “WASTE — A CUDA Emulator”

Writing a wrapper for the CUDA memory allocation API

 

As a software developer, you may be faced with the problem of modifying the behavior of a third-party program.  For example, you're writing a debugger that will check the usage of memory allocation in the program being debugged.  You often do not have the source for the program, so you cannot change it.  And, you do not have the sources for the memory allocation library that the program uses, so you cannot change that code either.   This is just the problem I was facing with Nvidia's CUDA API [1].  Although it is not a particularly hard problem, what I learned was yet another example of how frustrating it is to find a solution that should be well described, but is not. 

Continue reading “Writing a wrapper for the CUDA memory allocation API”

Parallel programming using CUDA

Last month, I took a short course called Introduction to Multicore Programming, which was taught by James T. Demers of Physical Sciences Inc. This course introduced me to current hardware and software systems for parallel computing. Personally, the offering of this course was timely: I have been reading about how important parallel programming is becoming; and, I was starting to become interested in parallel algorithms but my knowledge of parallel computing was embarrassingly sparse. The last time I looked at the subject was in the early 1990’s when I wrote a program that used MPI. Though I have an Intel Q6600 quad-core CPU machine (Figure 1, Figure 2), a byproduct of the rivalry between Intel and AMD, I never really bothered to program it for parallel computing because I did not think four cores would offer that much over one core. What I learned from this course was that I had my head in the sand. In fact, I was surprised to learn that that the graphics card I owned, an NVIDIA GeForce 9800 GT (Figure 3), was a parallel computing system which I could program. So, I decided to apply what I learned in the class by solving two programming exercises on my system (Figure 4): matrix multiplication and graph breadth-first search.  In this post, I describe the first problem, matrix multiplication.

Continue reading “Parallel programming using CUDA”