Limits on the size of the grid in the Nvidia CUDA with a two-dimensional grid?

0 like 0 dislike
Hi all. Maybe my question will seem to many at first sight obvious, but I would still urge you not to consider him stupid until until you have read to the end.

So what is the issue. As you know from the documentation the CUDA grid size launching the kernel has limitations that depend on the specific device. Most modern graphics cards, the limit is set 65535x65535x1. My g210m graphics card and the 8800gt is exactly what I checked. But in this place I met a very strange thing — in my program for some unknown reason I cannot run a kernel that would have the dimension (threads) more 5808x5808 (this number may be less depending on the size of the block, I wrote strict maximum) or more 264х264 (if measured in blocks) and the last number is constant. As soon as the number of executable blocks exceeds 265х265, the engine runs, works out, but as a result always outputs a zero.

The debugger and Nvidia Nsight is silent, no error is thrown, the Profiler produces the results in which the engine runs. Restriction POPs up on all cards, on which I ran the program — in total, 8 different models (8400m g, 8800gt, 9600gso, 8500gt, 9600gt, ION, g210m, gf9300)

So all this leads me to believe that there are limitations not only on the dimension of the mesh, but also on the total number of threads in the grid (because the number of threads per block limit is — why then it not to be). But that's neither official documentation nor tutorial Boreskov/Harlowe or best practices guide nothing this bill does not say — just say there is a limit, already announced at the beginning of the question.

Because I dig with this for about two hours a day for the past week and no progress, I am asking for help — where to dig? Any comments are welcome, if you want to make some clarification — tell me
by | 4 views

1 Answer

0 like 0 dislike
Just checked. I was not able to repeat your problem.
I have a GTX470.
So. Wrote engine:
__global__ void testKernel( int* g_odata) { if(threadIdx.x==0) { g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)] = blockIdx.y; g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)+1] = blockIdx.x; } } 

Launched it on 8192х8192 blocks and 1024 threads(in your vidhah a maximum of 512 threads in a block, 1024 on Fermi):
 dim3 grid( 8192, 8192, 1); dim3 threads( 1024, 1, 1); testKernel<<< grid, threads, 0 >>>( d_odata); 

Naturally allocated memory, etc.
And got the last element of the array: 8191x8191.
On a large number not tested, because the memory comes to an end :( it is Necessary to have some logic to implement.
And do not understand how you have these round numbers 265, 264?

Related questions

0 like 0 dislike
1 answer
asked Mar 20, 2019 by unbelieverbull
0 like 0 dislike
2 answers
asked May 22, 2019 by BestJS
0 like 0 dislike
3 answers
asked Mar 24, 2019 by kay
110,608 questions
257,186 answers
1,126 users