This is a quick high-level tour of my recent paper with Vashti Galpin, *Bounds on series-parallel slowdown*.

Suppose you are writing some parallel code. You fire up Matlab, and plan to use the Parallel Computing Toolbox to sort out the tedious details of scheduling. But it doesn’t run nearly as fast as you were hoping. What went wrong?

There are many ways for parallel computations to fail to achieve their potential. Ideally, running a computation on processors should take only times as long as on one processor, a speedup of times. At the very least, one would hope for speedup. For problems involving searching, starting the search in many different places can even make the speedup superlinear; so randomized restarts after a timeout can be a good strategy for such problems. But usually various delays are introduced in the parallel implementation, compared to the idealised computation that one would wish to see, so there is a *slowdown*.

The first attempt to explain slowdown was Amdahl’s Law: if a fraction of the computation is inherently not parallel, then as the rest of the computation disappears to zero with the addition of more and more processors, this non-parallel fraction will dominate. So if 1% is inherently not parallel, then even if the rest takes no time at all, the computation can never achieve speedup of more than 100. The figure illustrates how the speedup *S* is limited by both the number of processors *m* as well as the fraction *f* of the program that is inherently not parallel.

In practice, it is not clear how to find the magic fraction — what exactly does it mean for part of the code to be inherently not parallelisable?

Yet it turns out that adding just a little more detail to our model of parallel programs allows some progress. Let slowdown be the ratio of the time to run the program to the time one would hope it would take.

To keep things really simple, consider a particular instance of a computation, corresponding to the instructions and state of a parallel program while it runs. Group sequences of instructions into *activities* in some way, each having some duration. Then some parts of the computation will depend on the results produced by other parts, creating *precedence constraints* between activities: some activities will need to complete before other activities can start.

Now even if there are no delays at all due to communication, cache misses, resource contention, or any of the other fun things that make parallel systems so interesting, there will still be some inherent slowdown. *(Note added 07-May-2010: we work with such a simplistic model because any other sources of slowdown just make things worse! Amdahl’s argument also works this way. So a situation that is bad for the simple model will be even worse in a real system.)*

On the one hand, the activities could be represented directly, and we could tell the system about all the precedence constraints between them, letting it work things out.

But this is not what actually happens. Creating a synchronisation object (as in OpenCL) for every precedence constraint is tedious and adds a lot of overhead. Usually one instead uses a toolkit of parallel operations provided by the system. These involve creating and managing threads, sending and waiting for messages, or most commonly, operations like Matlab’s `parfor`

which flags activities that all may run in parallel.

Our representation of the parallel computation tells the system about our choice of activities and the precedence constraints between them. Each particular representation has syntactic limitations, which may implicitly introduce additional precedence constraints between activities.

We consider the question:

How much slowdown can we attribute to being restricted to only series-parallel language constructs?

Let’s look at an example. Suppose the activities are neighbour synchronised, as illustrated by the black arrows in the second figure. This structure is common in pipelines, or with finite element method calculations. The red arrows indicate precedence constraints added implicitly when we represent this computation in Matlab in the obvious way:

`for i = 1 : t`

parfor j = 1 : N

% compute activity

end

end

Even if everything else is ideal, the additional precedence constraints will cause slowdown. The combination of the `for`

and `parfor`

constructs turns the activity network into a *series-parallel* activity network, adding extra constraints implicitly. Many of the standard constructs to express parallelism implicitly series-parallelise the computation.

In the paper, we show that the slowdown can be arbitrarily large in the worst case, *for any choice of how to series-parallelise the computation*! If we choose the representation before the computation actually happens, and that representation forces the activity network to be of series-parallel form, then all the stochastic effects in the parallel system may interact badly with the particular choices made.

We also show that if one postpones the choice of which series-parallelisation to use until the activity durations are known, then a slowdown of at most should be possible. This remains a conjecture, although one that has been proved for small activity networks.

Finally, actually achieving a series-parallelisation with slowdown seems to be a difficult problem.

What does this mean? If the system is not prone to stochastic effects that may cause some activities to take much longer than we expect, then all we have to do is to ensure all activities are similar in duration. We show that this can then be handled very easily, and the slowdown is bounded above by the ratio of the largest to the smallest activity duration. But if this is not possible to achieve, then we need to express our computation in a better way.

I propose something like

`neisyn d w g f(d,w,g)`

where `d`

and `w`

are iterators, `g`

is an optional parameter describing the maximum out-degree of the network of constraints, and `f`

is a block of code that depends on `g`

and the current value of `d`

and `w`

. As a first attempt to implement `neisyn`

, it could be translated into the `for`

/`parfor`

combination above (or its equivalent), which is most likely how the code would have been written anyway if those were the only constructs available. If a better implementation of `neisyn`

were to replace this simple hack, the code would then not need to change.

A construct such as `neisyn`

is straightforward to use. To implement it efficiently probably requires highly non-trivial work behind the scenes, for instance an efficient dynamic scheduler. Yet there is clear value in trying this approach. Imagine the computation of taking a very long time for some reason. With a particular choice of series-parallelisation, every activity that has been requested to depend on will have to wait for it to finish. Instead, `neisyn`

allows the computation to proceed as a wave front across the activities, achieving better performance overall than a more rigidly structured computation. Testing this experimentally would be interesting.

There’s another situation where you can get superlinear speedup: up to the point where your local memory footprint drops below the size where most of it fits in the caches, adding more CPUs can give you better than linear speedup if the extra communication overheads are more than offset by fewer misses to DRAM.

Comment by Philip Machanick — 1 June 2009 @ 4:18 |