It seems that most tutorials, guides, books and Q&A from the web refers to CUDA 3 and 4.x, so that is why I'm asking it specifically about CUDA 5.0. To the question...
I would like to program for an environment with two CUDA devices, but use only one thread, to make the design simple (specially because it is a prototype). I want to know if the following code is valid:
float *x[2];float *dev_x[2];for(int d = 0; d < 2; d++) { cudaSetDevice(d); cudaMalloc(&dev_x[d], 1024);}for(int repeats = 0; repeats < 100; repeats++) { for(int d = 0; d < 2; d++) { cudaSetDevice(d); cudaMemcpy(dev_x[d],x[d],1024,cudaMemcpyHostToDevice); some_kernel<<<...>>>(dev_x[d]); cudaMemcpy(x[d],dev_x[d],1024,cudaMemcpyDeviceToHost); } cudaStreamSynchronize(0);}
I would like to know specifically if cudaMalloc(...)
s from before the testing for persist even with the interchanging of cudaSetDevice()
that happens in the same thread. Also, I would like to know if the same happens with context-dependent objects such as cudaEvent_t
and cudaStream_t
.
I am asking it because I have an application in this style that keeps getting some mapping error and I can't find what it is, if some missing memory leak or wrong API usage.
Note: In my original code, I do check every single CUDA call. I did not put it here for code readability.