当前位置: 动力学知识库 > 问答 > 编程问答 >

cuda - Loading from global memory

问题描述:

Suppose simple kernel like this:

__global__ void fg(struct s_tp tp, struct s_param p)

{

const uint bid = blockIdx.y * gridDim.x + blockIdx.x;

const uint tid = threadIdx.x;

const uint idx = bid * blockDim.x + tid;

if(idx >= p.ntp) return;

double3 r = tp.rh[idx];

double d = sqrt(r.x*r.x + r.y*r.y + r.z*r.z);

tp.d[idx] = d;

}

Is this true ?:

 double3 r = tp.rh[idx];

  • data are loaded from global memory into r variables.

  • r are stored in registers or if there is many variables, in local memory.

  • r are not stored in shared memory.

  • d are calculated and after that written back into global memory.

  • registers are faster than other memories.

  • if the space of registers is full (some big kernels), local memory is used, and the access is slower

  • when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?

Thanks to all.

网友答案:

Yes, it's pretty much all true.

•when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?

Using shared memory is useful when there is either data reuse (loading the same data item more than once, usually by more than one thread in a threadblock), or possibly when you are making a specialized use of shared memory to aid in global coalescing, such as during an optimized matrix transpose.

Data reuse means that you are using (loading) the data more than once, and for shared memory to be useful, it means you are loading it more than once by more than one thread. If you are using it more than once in a single thread, then the single load plus the compiler (automatic) "optimization" of storing it in a register is all you need.

EDIT The answer given by @Jez has some good ideas for optimal loading. I would suggest another idea is to convert your AoS data storage scheme to a SoA scheme. Data storage transformation is a common approach to improving speed of CUDA codes.

Your s_tp struct, which you haven't shown, appears to have storage for several double quantities per item/struct. If you instead create separate arrays for each of these quantities, you'll have opportunities for optimal loading/storage. Something like this:

__global__ void fg(struct s_tp tp, double* s_tp_rx, double* s_tp_ry, double* s_tp_rz, double* s_tp_d, struct s_param p)

{

  const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
  const uint tid = threadIdx.x;
  const uint idx = bid * blockDim.x + tid;

  if(idx >= p.ntp) return;

  double rx = s_tp_rx[idx];
  double ry = s_tp_ry[idx];
  double rz = s_tp_rz[idx];

  double d = sqrt(rx*rx + ry*ry + rz*rz);

  s_tp_d[idx] = d;

}

This approach is likely to have benefits elsewhere in your device code also, for similar types of usage patterns.

网友答案:

It's all true.

when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?

For the example you gave, your implementation is possibly not optimal. The first thing you should do is compare the bandwidth acheived to that of a reference kernel, for example, a cudaMemcpy. If the gap is large, and the speedup you'll gain from closing this gap is significant, optimisations may be possible.

Looking at your kernel there are two things that strike me as potentially suboptimal:

  1. There's not much work per thread. If possible, processing mulitple elements per thread can improve performance. This is, in part, because it avoids thread intialisation/removal overheads.
  2. Loading from a double3 isn't as efficient as loading from other types. The best way to load data is usually using 128-bit loads per thread. Loading three consective 64-bit values will be slower, perhaps not by a lot, but slower all the same.

EDIT: Robert Crovella's answer below gives a good solution to the second point which requires changing around your data type. For some reason I had originally thought this wasn't an option, so the below solution is probably over-the-top if you cna just change your data type!

While adding more work per thread is a fairly simple thing to try, optimising your memory access pattern (without changing your datatype) for a solution is less so. Fortunately there are libraries that can help. I think that using CUB, and in particular, the BlockLoad collective, should allow you to load more efficently. By loading, say, 6 double items per thread using a transpose operator, you can process two elements per thread, pack them into a double2, and store them normally.

分享给朋友:
您可能感兴趣的文章:
随机阅读: