Hi all. Maybe my question will seem to many at first sight obvious, but I would still urge you not to consider him stupid until until you have read to the end.
So what is the issue. As you know from the documentation the CUDA grid size launching the kernel has limitations that depend on the specific device. Most modern graphics cards, the limit is set 65535x65535x1. My g210m graphics card and the 8800gt is exactly what I checked. But in this place I met a very strange thing — in my program for some unknown reason I cannot run a kernel that would have the dimension (threads) more 5808x5808 (this number may be less depending on the size of the block, I wrote strict maximum) or more 264х264 (if measured in blocks) and the last number is constant. As soon as the number of executable blocks exceeds 265х265, the engine runs, works out, but as a result always outputs a zero.
The debugger and Nvidia Nsight is silent, no error is thrown, the Profiler produces the results in which the engine runs. Restriction POPs up on all cards, on which I ran the program — in total, 8 different models (8400m g, 8800gt, 9600gso, 8500gt, 9600gt, ION, g210m, gf9300)
So all this leads me to believe that there are limitations not only on the dimension of the mesh, but also on the total number of threads in the grid (because the number of threads per block limit is — why then it not to be). But that's neither official documentation nor tutorial Boreskov/Harlowe or best practices guide nothing this bill does not say — just say there is a limit, already announced at the beginning of the question.
Because I dig with this for about two hours a day for the past week and no progress, I am asking for help — where to dig? Any comments are welcome, if you want to make some clarification — tell me