当前位置: 动力学知识库 > 问答 > 编程问答 >

c++ - How can you factor out branching from tight loop?

问题描述:

My question is: How can I add features to my processing loop without the overhead of checking the true/falseness of the user settings to navigate the branches? The settings are the same for all iterations on the loop. Do modern processors with branch prediction make this unnecessary?

My programs over follow this pattern:

  1. User adjusts settings, checkboxes, sliders, numerical entries.
  2. Data is processed when an update is triggered

    1. Apply settings to local variables
    2. Run loop over a large dataset

      • add if statements to bypass unused code from the user settings.
      • return from loop
    3. return transformed data

How can you template or inline out all permutations ahead of time?

example:

bool setting1 = true;

bool setting2 = false;

vector<float> data;

for(int i=0;i<100000;i++)

data.push_back(i);

for(int i=0;i<100000;i++) {

if (setting1)

{

doStuff(data[i]);

....

}

if (setting2)

{

doMoreStuff(data[i]);

.....

}

.... //etc

}

I know this is a silly example. But I'd like to know what pattern scales when there are lots of branches.

网友答案:

Use templates for the main loop.

template <bool A, bool B>
void loop() {
  while (1) {
      if (A) // will get compiled out if A == false
      {
         doStuff(data[i]);
         ....
      }
      if (B)
      {
         doMoreStuff(data[i]);
         .....
      }

      .... //etc
  }
}

When you change settings: (you could probably make this less code)

if (setting1) {
  if (setting2)
    loop<1,1>;
  else 
    loop<1,0>;
}
else {
  if (setting2)
    loop<0,1>;
  else 
    loop<0,0>;
}

You want to stay in loop() until settings change.

This should be used with care as it can lead to bloat.


Profiled the answers (G++ O2 optimization):

 %Time
 46.15      0.84     0.84                             fb() (blackbear)
 38.37      1.53     0.69                             fa() (OP)
 16.13      1.82     0.29                             fc() (pubby8)
网友答案:

Firstly, unless the operations are incredibly cheap compared to the cost of an iteration of the loop (branches + loop overhead), simply don't worry about it and do whatever is most readable. Premature optimisation is the root of much evil; don't just assume things will be slow, do some profiling so that you know.

If you do find yourself genuinely spending more time iterating than doing useful work - that is, your overhead is too high - you need to find a sensible way to reduce the overhead; so, to select between different loop bodies/implementations optimised for particular combinations of inputs.

Factoring the conditions out of the loop, to make multiple loops, might initially seem like a good idea, however if most of the settings are mostly enabled and your actual operations are cheap enough to make the overhead an issue at all, you may find the performance largely unchanged - each of the new loops has a per-iteration cost!

If that is the case, a way forward might be to use templates or other means to instantiate variants of the loop body for the most common combinations of inputs, select at a high level between loops calling those when a suitable one is available, and fall back to the generic case when it is not.

网友答案:

If the size of the dataset is known at compile time, then the compiler can potentially perform:

  • loop unrolling

If it is a mathematical operation

  • vectorization can come into play

You can also do the logic inside-out:

if (epic-setting)
{
//massive for loop
}

It is bad for memory locality, as one person said.

Branch prediction will help you a good deal, if and only if the cost of the missed branch is less than the speedup given (for a large dataset, it should help, not hurt).

If your data operation is fully parallel, ie, you are running SIMD, you could investigate threading out the operations: e.g., open up 3 threads, and have all 3 take the i % t operation, t being the thread index, i being the data index. (You can partition the data different ways). For a large enough data set, presuming you don't have synch operations, you will see a linear speedup.

If you are writing this for a specialized system, e.g., an industrial computer with a given CPU, and you can assume you will always have that CPU, you can optimize much more heavily for what that CPU can support. Things like exact cache size, depth of pipeline, etc, can all be coded in. Unless you can assume that, it's sketchy to try and assume on those counts.

网友答案:

You can avoid the overhead this way (supposing settingx doesn't affect settingy):

if(setting1) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

if(setting3) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

if(setting3) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

But, in my opinion, the best solution is keeping your code. Today's branch prediction units are very powerful, and considering that you'll loop many thousands of times with every branch having the same result, you can afford a few cycles of warm up ;)

EDIT:
I compared our approaches to the problem with a simple console program (sources, it's c# though). The loop is executed 1000000 times and I used trigonometrical functions together with double precision floating point operations. Test 2 is the solution I showed above, and the three letters are the value of setting1, setting2 and setting3.
Results are:

test 1 - fft: 13974 ms
test 2 - fft: 14106 ms

test 1 - tft: 27728 ms
test 2 - tft: 28081 ms

test 1 - ttt: 41833 ms
test 2 - ttt: 41982 ms

I also did a test run with all three test functions empty to prove loop and calling overhead is minimal:

test 1 - fft: 4 ms
test 2 - fft: 4 ms

test 1 - tft: 8 ms
test 2 - tft: 8 ms

test 1 - ttt: 12 ms
test 2 - ttt: 12 ms

Effectively, my solution is about 1% slower. The second point of my answer, though is proven to be correct: loop overhead is completely trascurable.

分享给朋友:
您可能感兴趣的文章:
随机阅读: