Thursday, September 18 2008, 23:42
First steps in CUDA
I really want to throw myself in CUDA, so here I go. The first step will be to create a simple project, that runs under visual studio.
First, I just want to have a working environment that works with cuda. Since I don't want to fight with VS to handle cuda, i just searched for a VS2005 (sigh, cuda doesn't support VS2008 yet :'( ) project template. I found this one, which was the google first guess : http://forums.nvidia.com/index.php?showtopic=69183 It provides the Debug/Release/EmuDebug/EmuRelease modes, and the appropriated paths for compiling, which is exactly what i was looking for.
Good surprise, i works perfectly. The provided sample compiles without any issue, but it won't launch itself, because it can't find "cutild64.dll". It seems that the SDK doesn't set the PATH so the system can find cuda's dlls... I don't want my cuda's dll to be somewhere in \system32 or anywhere else than the cuda sdk, so i will upgrade my env variables. Go to the environment variable setting pannel (win key + pause key => advanced => env variables) For the moment I will only add debug DLLs, in order to try. I add a CUDA_DLL_PATH var that is equal to "%NVSDKCUDA_ROOT%\bin\win64\Debug" and added the following to my Path variable : ";%CUDA_DLL_PATH%"
After relaunching visual studio, it worked verry well. So let's add the release path, (before i forget to :D ) by adding "%NVSDKCUDA_ROOT%\bin\win64\Debug" to your newly created CUDA_DLL_PATH From now, I can compile and then launch CUDA project, after associating .cu files to visual c++, it's finally ok to start coding :)
Now lets go into cuda itself...
In order to launch code on the GPU, CUDA provides "Kernels". These are function that you can call from the CPU, but that will be run on the GPU. These functions are declared by the "global" keyword.
The "device" keywords means that the function will be callable from the GPU only, and run on it. The "host" keyword means that the function will be callable from the CPU only, and run on it. "host" is the default function qualifier. If you don't specify one, it will be host by default.
GPUs are highly parallel architectures, a high number of threads will run through your code, but you have to specify this number of threads. This is done when calling the kernel :
kernelFunctionName<<<blocksNumber, blockSize>(params);
It allows you to set your thread blocks parameters. A thread block is... well it's a block containing threads :D These informations are contained in a cuda special type called "dim3" which is basically an array of 3 integers.
If you set your blockSize to "dim3(5, 5)" you will have 5 * 5 ( * 1, for the implicit Z dimension) = 25 threads.
Then you can choose to create multiple threads blocks, by doing the same for the first kernel parameter (blocksNumber).
A set of threads blocks is called a grid.
If you choose dim3(2, 2) you will have 4 blocks of 25threads running through your code. Optimally choosing the right number of blocks and threads probably requires a high knowledge of the cuda doc, mostly the chapter 5, which is focused on optimisation.
But if you call your kernel, you must have a way to get the results of your computation back. This is done by allocating memory on the device, copying data to it, and then copy the results back in the main memory. CUDA provides a set of functions for this purpose. The one I will use are :
- cudaMalloc(void** ptrToAlloc, size_t sizeToAlloc);
- cudaMemcpy(void* destination, void* source, size_t sizeToCopy, cudaMemcpyKind mode); //cudaMemcpyKind is an enum to set the copy mode (host to device, device to host...)
- cudaFree(void* ptrToFree);
For a test, i will write a little main.cu which will square every elements in an array. :
First, the main itself. It consists of classic c/c++
int main() { //setting the number of elements to compute const int NbElems = 100; //classic c++ allocation of the data to compute float *toSquare = new float[NbElems]; for (int i = 0; i < NbElems; ++i) toSquare[i] = (float)i; float* device = NULL; //allocating memory on the device cudaMalloc((void**)&device, sizeof(*toSquare) * NbElems); // and copying datas to this mevory space cudaMemcpy(device, toSquare, sizeof(*toSquare) * NbElems, cudaMemcpyHostToDevice); //Cuda Kernel execution { //we will have 5 * 5 threads blocks : dim3 blockSize(10, 10); //Z dimension is implicitly 1 const int nbThreads = blockSize.x * blockSize.y; //we must have a grid containing enougth threads to compute every data, so the number of elements divided by the number of threads, and if it remains data to compute, we add one thread block int nbBlocks = NbElems / nbThreads + (NbElems % nbThreads == 0 ? 0 : 1); // and we launch the kernel ! SquareArray<<<nbBlocks, blockSize>>>(device, NbElems); } // getting the results back cudaMemcpy(toSquare, device, sizeof(*toSquare) * NbElems, cudaMemcpyDeviceToHost); //and this doesn't need commentary i guess :p for (int i = 0; i < NbElems; ++i) std::cout << i << " => " << toSquare[i] << "\n"; std::cout.flush(); cudaFree(device); delete[] toSquare; return 0; }
So we have our main code, but it misses a kernel to compile and run... Beside the global qualifier, a cuda function is exactly the same as in C :
__global__ void SquareArray(float* in, int nb) { int id = ((blockIdx.x + blockIdx.y * blockDim.x) * (blockDim.x * blockDim.y)) + blockDim.x * threadIdx.y + threadIdx.x; if (id < nb) in[id] = in[id] * in[id]; }
The first line is just to know the array of the current array index this kernel has do compute. Knowing the thread and blocks size and id allow a cuda code to run lock free, which is kinda fast :p
Then there's just a "bounds" check, which, actually, isn't necessary. Without it the code will be executed just fine. Maybe there's a change in perfs, but I didn't pay attention to this yet.
Now it's complete, both main and kernel are done, compile and run, and you've a nice "array squarer" in cuda ! Hurray !
Now the next step will be to abstract the memory stuff in a class, to ease the pain :p
one comment
Thanks for this, came across the same problem of missing DLL's, your blog helped.