This is a quick high-level tour of my recent paper with Vashti Galpin, Bounds on series-parallel slowdown.
Suppose you are writing some parallel code. You fire up Matlab, and plan to use the Parallel Computing Toolbox to sort out the tedious details of scheduling. But it doesn’t run nearly as fast as you were hoping. What went wrong?
There are many ways for parallel computations to fail to achieve their potential. Ideally, running a computation on processors should take only
times as long as on one processor, a speedup of
times. At the very least, one would hope for
speedup. For problems involving searching, starting the search in many different places can even make the speedup superlinear; so randomized restarts after a timeout can be a good strategy for such problems. But usually various delays are introduced in the parallel implementation, compared to the idealised computation that one would wish to see, so there is a slowdown.
The first attempt to explain slowdown was Amdahl’s Law: if a fraction of the computation is inherently not parallel, then as the rest of the computation disappears to zero with the addition of more and more processors, this non-parallel fraction will dominate. So if 1% is inherently not parallel, then even if the rest takes no time at all, the computation can never achieve speedup of more than 100. The figure illustrates how the speedup S is limited by both the number of processors m as well as the fraction f of the program that is inherently not parallel.

In practice, it is not clear how to find the magic fraction — what exactly does it mean for part of the code to be inherently not parallelisable?
Yet it turns out that adding just a little more detail to our model of parallel programs allows some progress. Let slowdown be the ratio of the time to run the program to the time one would hope it would take.
To keep things really simple, consider a particular instance of a computation, corresponding to the instructions and state of a parallel program while it runs. Group sequences of instructions into activities in some way, each having some duration. Then some parts of the computation will depend on the results produced by other parts, creating precedence constraints between activities: some activities will need to complete before other activities can start.
Now even if there are no delays at all due to communication, cache misses, resource contention, or any of the other fun things that make parallel systems so interesting, there will still be some inherent slowdown.
On the one hand, the activities could be represented directly, and we could tell the system about all the precedence constraints between them, letting it work things out.
But this is not what actually happens. Creating a synchronisation object (as in OpenCL) for every precedence constraint is tedious and adds a lot of overhead. Usually one instead uses a toolkit of parallel operations provided by the system. These involve creating and managing threads, sending and waiting for messages, or most commonly, operations like Matlab’s parfor which flags activities that all may run in parallel.
Our representation of the parallel computation tells the system about our choice of activities and the precedence constraints between them. Each particular representation has syntactic limitations, which may implicitly introduce additional precedence constraints between activities.
We consider the question:
How much slowdown can we attribute to being restricted to only series-parallel language constructs?

Let’s look at an example. Suppose the activities are neighbour synchronised, as illustrated by the black arrows in the second figure. This structure is common in pipelines, or with finite element method calculations. The red arrows indicate precedence constraints added implicitly when we represent this computation in Matlab in the obvious way:
for i = 1 : t
parfor j = 1 : N
% compute activity
end
end
Even if everything else is ideal, the additional precedence constraints will cause slowdown. The combination of the for and parfor constructs turns the activity network into a series-parallel activity network, adding extra constraints implicitly. Many of the standard constructs to express parallelism implicitly series-parallelise the computation.
In the paper, we show that the slowdown can be arbitrarily large in the worst case, for any choice of how to series-parallelise the computation! If we choose the representation before the computation actually happens, and that representation forces the activity network to be of series-parallel form, then all the stochastic effects in the parallel system may interact badly with the particular choices made.
We also show that if one postpones the choice of which series-parallelisation to use until the activity durations are known, then a slowdown of at most should be possible. This remains a conjecture, although one that has been proved for small activity networks.
Finally, actually achieving a series-parallelisation with slowdown seems to be a difficult problem.
What does this mean? If the system is not prone to stochastic effects that may cause some activities to take much longer than we expect, then all we have to do is to ensure all activities are similar in duration. We show that this can then be handled very easily, and the slowdown is bounded above by the ratio of the largest to the smallest activity duration. But if this is not possible to achieve, then we need to express our computation in a better way.
I propose something like
neisyn d w g f(d,w,g) where d and w are iterators, g is an optional parameter describing the maximum out-degree of the network of constraints, and f is a block of code that depends on g and the current value of d and w. As a first attempt to implement neisyn, it could be translated into the for/parfor combination above (or its equivalent), which is most likely how the code would have been written anyway if those were the only constructs available. If a better implementation of neisyn were to replace this simple hack, the code would then not need to change.
A construct such as neisyn is straightforward to use. To implement it efficiently probably requires highly non-trivial work behind the scenes, for instance an efficient dynamic scheduler. Yet there is clear value in trying this approach. Imagine the computation of taking a very long time for some reason. With a particular choice of series-parallelisation, every activity that has been requested to depend on
will have to wait for it to finish. Instead,
neisyn allows the computation to proceed as a wave front across the activities, achieving better performance overall than a more rigidly structured computation. Testing this experimentally would be interesting.
Three results
Tags: papers
Three fascinating results that appear to have no connection other than being rather surprising.
First, the chromatic number of the plane. The number of colours required for colouring the real plane, where no two points unit distance apart may be the same colour, is known to be either 4, 5, 6, or 7. Shelah and Soifer noted in 2003 that the actual value relates to the proposed axiom LM of set theory that requires that every set of reals is Lebesgue measurable. Specifically, the chromatic number of the plane is 4 with the usual ZF axioms of set theory, augmented with the Axiom of Choice. This was shown by Erdős and de Bruijn in 1951. However, the chromatic number of the plane is at least 5 in ZF together with LM and a weaker version of AOC, the Principle of Dependent Choices that Bernays proposed in 1942. So the Axiom of Choice does impact the real world, after all, at least for some definition of “real world”!
The second result is a problem that only requires a poly-logarithmic number of nondeterministic steps for a polynomial-time solution, and which has a deterministic runtime upper bound that is almost, but not quite, polynomial. A hitting set X of a hypergraph H (also known as a transversal of H) intersects each edge of H. A minimal transversal contains no strict subset that is a transversal. The hypergraph of all minimal transversals of a hypergraph H is its transversal hypergraph. TRANS-HYP is the problem of deciding, given two hypergraphs H and H’, whether H’ is the transversal hypergraph of H. TRANS-HYP is in co-NP but not obviously in NP. Fredman and Khachiyan showed that it is possible to recognize the transversal hypergraph of a given H in time
, where
is the size of the input. (This result was actually expressed in terms of recognition of duals of monotone Boolean functions.) The surprising result is that it is possible to show that a given hypergraph is not a transversal hypergraph of another in polynomial time using only
nondeterministic steps. This was shown by Eiter, Gottlob, Makino and independently by Kavvadias and Stavropoulos, in 2002-3. If one is allowed a polynomial number of nondeterministic steps then of course one can capture NP, while a logarithmic number of nondeterministic steps captures P. This important problem therefore occupies a surprising place in the bounded nondeterminism hierarchy (which was last surveyed by Goldsmith, Levy, Mundhenk in 1997). A succinct overview of the state of the art is the paper by Eiter, Makino, Gottlob published in 2008.
Finally, an upcoming paper. Michael Mitzenmacher has briefly discussed the STOC 2009 programme; following on, I would recommend the recent paper by Moser and Tardos; it presents a beautiful and short algorithmic proof of the Lovász Local Lemma in a very general form, and appears to supersede the (excellent) paper of Moser at STOC. But back to the paper I actually wanted to highlight: in among the STOC 2009 abstracts there was the intriguing comment in the abstract of a paper by Kawarabayashi and Reed discussing their algorithm deciding the Hadwiger Conjecture. For the special case
this algorithm apparently implies the Four-Colour Theorem. Since the abstract seems to indicate that the detailed case analysis of the previous constructive proofs is largely avoided, I await the full paper with great anticipation. (Edit: the full paper has corrected the sentence. The result seems to be an important step forward in proving the Hadwiger Conjecture, but it doesn’t supply a short proof for the 4CT, alas.)