In my application I wanted to copy 3D array data from linear memory (allocated with
) to a
with
. Now, I had the problem that
reported
. If you encounter a problem like that it might be because of the fact that
was used instead of
for allocatimg the linear memory. If
was used then you are forced to create you own
with
in order to call
. According to the manual (this information is not extremely emphasized in it)
is not guaranteed to construct a pitched pointer which is valid for memory copy procedures. One should always use
or
whenever we want to memcpy because these methods guarantee valid pitched pointers.
“For allocations of 2D and 3D objects, it is highly recommended that programmers perform allocations using cudaMalloc3D() or cudaMallocPitch(). Due to alignment restrictions in the hardware, this is especially true if the application will be performing memory copies involving 2D or 3D objects (whether linear memory or CUDA arrays).” [from the Reference Manual]
This, in fact, solved my problem of the
when I tried to
. However, the fact that we use pitched pointer complicates the array access in the kernels. First of all we have to keep track of the data type stored in a pitched pointer (ptr is a void pointer). Furthermore, we have to be careful when dealing with pointer arithmectics because the pitch is, obviously, a byte offset (width of the array). Here is some sample code that shows how to access pitched pointer data. We assume we want to store
data in our 3D array which is of size (width, height, depth). In the kernel we want to access the element (x,y,z).
Host Code:
1
2
3
4
5
| ...<br />
cudaPitchedPtr data;<br />
extent = make_cudaExtent(width* sizeof(float4), height, depth);<br />
cudaMalloc3D(&(d_data), extent);<br />
... |
...<br />
cudaPitchedPtr data;<br />
extent = make_cudaExtent(width* sizeof(float4), height, depth);<br />
cudaMalloc3D(&(d_data), extent);<br />
...
Kernel Code:
1
| __global__ kernel(char* data, size_t pitch, ...) { |
__global__ kernel(char* data, size_t pitch, ...) {
1
2
3
| ...<br />
float4 element = *((float4*) (data + (x*sizeof(float4) + y*pitch + z*pitch*height)));<br />
... |
...<br />
float4 element = *((float4*) (data + (x*sizeof(float4) + y*pitch + z*pitch*height)));<br />
...
The one thing I’d like to point out is the fact that we have to use
for accessing the data. This makes sense because
is a char/byte pointer at the time of the pointer arithmetic. Thus, we have to move the pointer by
. This musn’t be done in the y and z direction because the
is already considered in the pitch (the pitch is the width of the array in byte, probably padded for faster memory access).