Working with my spiffy new abstract math library in development, I got really deep in the weeds when it came down to implementing fully generalized parallel execution. So deep, that I realized I pretty much need to write a full out blog article of its own to be able to explain everything that is going on in the code, mere documentation comments won’t do the trick.
That being said, I’ll indeed have to actually put together something more like a “math textbook” in conjunction with the code I am developing. But, for now, at least we have a better explanation of what is going on.
All vector operations can be defined by the following attributes and their available configurations.
- Input: scalar/vector
- Output: scalar/vector
- Working memory: scalar/vector
- Compute model: sequential/parallel
- Loop programming model: To be discussed later
The three higher-order vector functions are defined solely by how their inputs and outputs relate in dimension.
- Unfold (like a generator function): scalar input, vector output
- Map: vector input, vector output
- Reduce: vector input, scalar output
Now, the trick is to take these simple operations, defined abstractly, and map them down into efficient computational primitives. Being abstractly defined, it is then therefore easy to write concrete functions based off of these abstract primitives that will then take on the efficient computational primitives that they embody.
So, that being said, let’s dive down into the details about the different methods by which these higher-order functions can be implemented. There are two tough realities that need to be worked with.
-
When it comes down to writing up actual code to implement these primitives, there is often times one easiest way to write the code, but that one preferential implementation style may be neither efficient for sequential nor parallel computation.
-
Although it would be ideal to execute the primitives in parallel as much as possible, there will always be times where some real-world conditions about the particular compute environment make it such that sequential computation is most efficient. Therefore, we must also ensure that there are efficient sequential computational primitives available.
So now, let’s briefly describe how the higher-order vector functions can be implemented either sequentially or in parallel.
-
Sequential Unfold: Given the starting scalar value, loop to repeatedly apply the expansion operation the given number of times to generate the output sequence.
-
Parallel Unfold: Given a previous vector (scalar for starting value), apply the expansion operation in parallel across each previous scalar to get a set of next scalars. The result is that the sequence is doubled in count. A logarithmic number of sequential steps is required to expand the sequence the given number of times.
-
Sequential Map: Loop through each scalar and compute the output one at a time from the corresponding inputs.
-
Parallel Map: Compute all output scalars in parallel from the corresponding inputs.
-
Sequential Reduce: Combine together the first two scalars, then combine each next scalar to the accumulated result, one at a time.
-
Parallel Reduce: Combine together each pair of adjacent scalars in parallel. This reduces the scalar count by two. Keep doing this until you are down to one scalar. A logarithmic number of sequential steps is required to reduce down to a single scalar.
Okay, that’s nice and easy to abstractly phrase out and understand in concept, but how do you implement this as actual code? We need coding constructs to describe sequential and parallel computation.
Sequential looping: Loops, just like in the C programming language:
while
, do-while
, for
.
Parallel execution: Threads. Some implementations may encourage writing a “loop body” that will be used in each parallel thread, but essentially it is the same thing.
At the machine level, this is, of course, lower level and more difficult for programmers to understand.
Sequential looping:
- Conditional jump, i.e. “branch” instruction.
Parallel execution:
-
SIMD (Single Instruction Multiple Data), a single instruction that operates on multiple data items at the same time.
-
SIMT (Single Instruction Multiple Thread), multiple threads can execute concurrently only when they are sharing the same instruction, otherwise they must execute sequentially. Each thread has its own independent scalar registers that can load and store to their own memory addresses, without the need for data to conform to a specific vector data format, unlike SIMD.
On the other hand, this can be trivially shown to be almost identical to a SIMD architecture that implements gather-scatter load-store instructions. Even with just instructions for load-store with strides, almost all parallel execution procedures can be implemented.
-
MIMD (Multiple Instruction Multiple Data), this is basically another way of describing superscalar processor (processor = microprocessor, CPU (Central Processing Unit) when it is the main one), or even just a multi-core processor. Instructions are split up into separate instruction streams that execute on independent scalar processors.
For a superscalar processor, a compiler may optimize by performing loop unrolling (if necessary). For a multi-core processor, thread-level parallelism is used.
Fortunately, it is trivial for compilers to translate from the high-level language forms to the low-level machine forms. For example, loops can be rewritten so they encompass a branch instruction at the beginning and the end.
Another example, to take advantage of a SIMD processor, OpenCL uses the model of thread-level parallelism to describe the logical computation on one scalar. When the threads execute their instruction streams in lock-step, that trivially maps down to a stream of SIMD instructions. When there is branching, the code can be transformed so that nullification operations are computed where the branch values are false, essentially creating a stream of no-ops where threads would otherwise diverge. And, when memory addresses diverge, load-store with stride instructions and gather-scather load-store instructions can be used to keep the vector data formats in lock-step by the time they reach the compute instructions.
By the way, if you are concerned that your compiler is not generating nullification operations but instead generating divergent threads executed sequentially, how do you express that nullification operation in software to guarantee the threads stay in lock-step? Easy, basically you can express a conditional move in software by means of generating AND and OR masks. Or, alternatively, by multiplying and adding.
/* Conditional move: if (cond) dest = src; */
/* To generate masks, we use negation as a sign extension trick. This
means that the value 1 must be strictly used for `true` in
`cond`. */
#define CMOV(cond, int_type, dest, src) \
dest &= -(int_type)(!(cond)); \
dest |= (-(int_type)(cond)) & (src);
/* Alternative implementation that is compatible with floating point
arithmetic. */
#define CMOV(cond, float_type, dest, src) \
dest *= !(cond); \
dest += (float_type)(cond) * (src);
And actually… we can go both ways. If we want to generate conditional code rather than execute a conditional move, we can define the following macro.
/* Helper macro to replace a conditional move with just a C-style
conditional assign. */
#define CMOV(cond, int_type, dest, src) \
if (cond) (dest) = (src);
However, the main caveat with this software formulation is that address bus reads and writes are still generated even when the condition evaluates to be false. In the event that this reaches into addresses prohibited by the MMU (Memory Management Unit) or addresses mapped to devices, this could cause your program (or even the system!) to crash.
Finally, another variant, conditional move with an arbitrary code block that can be completely skipped.
/* Actual conditional move. */
#define CMOV_CODE_BLK(cond, int_type, dest, src, code_block) \
code_block; \
CMOV(cond, int_type, dest, src);
/* Replacement that turns into C-style conditional branch. */
#define CMOV_CODE_BLK(cond, int_type, dest, src, code_block) \
if (cond) { \
code_block; \
(dest) = (src); \
}
Higher-order vector vector function implementations
So… let’s look at different ways of implementing these primitives.
First of all, the easiest, a map
, sequential compute model.
void ABSTRACT_MAP(Scalar *a, Scalar *b, ...) {
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
/* PARALLEL LOOP */
MAP_OPERS(a[i], i);
}
}
MAP_OPERS
is a macro expansion that uses two arguments: where to
write the output and what index to use. It must also conform to the
restrictions of a map
operation, namely by using the same index
value in all vector operands and not having any dependencies across
iterations. It can use any other arbitrary inputs specified as
additional function arguments.
Here is a concrete example of map
: Divide each number by d
.
void div5_each(Scalar *a, Scalar *b, Scalar d) {
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
/* PARALLEL LOOP */
a[i] = b[i] / d;
}
}
How about the parallel compute model? Fortunately, we can use the
nifty loop-body trick to express thread-level parallelism for map
.
So, it is trivially just spawning a thread that calls a function that
contains MAP_OPERS
commands and passing it the output to write and
the current index.
Since our code must be able to run both sequentially as well as in parallel, the order of iteration in the sequential compute model does not matter. Therefore, we can restructure the loop like this, which reduces the register burden on the CPU to execute the loop.
void ABSTRACT_MAP(Sclar *a, Scalar *b, ...) {
unsigned i = VEC_LEN;
while (i > 0) {
i--;
MAP_OPERS(a[i], i);
}
}
reduce
is fairly straightforward to implement in a sequential
compute model… almost. There are two main variants, the
significance of which we’ll get into later.
/* Variant #1 */
/* If vector length (dimension) is zero, the result is undefined. */
Scalar ABSTRACT_REDUCE(Scalar *v) {
Scalar accum = v[0]; /* Initialize to vector's first scalar. */
unsigned i;
for (i = 1; i < VEC_LEN; i++) {
accum = COMBINE(accum, v[i]);
}
return accum;
}
/* Variant #2 */
Scalar ABSTRACT_REDUCE(Scalar *v) {
/* Initialize to "identity value." */
Scalar accum = IDENT_VAL;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
accum = COMBINE(accum, v[i]);
}
return accum;
}
The “identity value” is the value that when combined with any other value, results in the same other value unchanged. A few examples:
0 + n = n
1 * n = n
min(INT_MAX, n) = n
max(INT_MIN, n) = n
Here are two concrete examples of reduce
: sum
and product
.
Scalar sum(Scalar *v) {
/* Initialize to "identity value." */
Scalar accum = 0;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
accum = accum + v[i];
}
return accum;
}
Scalar product(Scalar *v) {
/* Initialize to "identity value." */
Scalar accum = 1;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
accum = accum * v[i];
}
return accum;
}
Now, let’s show a sequential compute model unfold
. Our
implementation applies one restriction that helps make defining the
parallel compute model easier.
/* Unfold to the vector width in-place. */
void ABSTRACT_UNFOLD(Scalar *v) {
Scalar accum = BASE;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
v[i] = accum;
accum = COMBINE(accum, BASE);
}
}
Now for two concrete examples, list of consecutive integers counting
by n
and list of consecutive powers of a base n
.
void count_by_n(Scalar *v, Scalar n) {
Scalar accum = n;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
v[i] = accum;
accum = accum + n;
}
}
void powers_of_n(Scalar *v, Scalar n) {
Scalar accum = n;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
v[i] = accum;
accum = accum * n;
}
}
Now, for the parallel compute models. reduce
and unfold
are
trickier because they require both sequential and parallel loops. The
sequential loop iterations must be delineated by “barrier”
synchronization primitives, so that if the threads are being executed
sequentially on a scalar processor, they are all guaranteed to be
lock-step before the next loop iteration. On a vector processor, this
turns into a no-op when all threads fit the width of a single hardware
SIMD/SIMT vector unit.
First, let’s start with parallel reduce
. Here, the goal is to
combine adjacent scalars together in parallel as much as possible. A
logarithmic number of sequential steps is required to reduce down to a
single scalar. A simple diagram for the concrete case of summing a
list of integers illustrates this succinctly.
1 2 3 4 5 6 7 8
1+2 3+4 5+6 7+8
1+2+3+4 5+6+7+8
1+2+3+4+5+6+7+8
Now, for the code to implement this in general.
/* Reduce a vector in-place, the result of the reduction is stored in
the first element of the vector.
PLEASE NOTE: We rewrite the vector in place as our working memory,
so pass a copy! */
void ABSTRACT_REDUCE(Scalar *v) {
unsigned stride = 2;
unsigned stride_log2 = 1;
unsigned sd_half = 1; /* Half stride */
unsigned sub_len;
/* sub_len computation is reformulation of the following:
while (stride * i + sd_half < VEC_LEN) { ... } */
-> reorganize with correct rounding ->
while ((sub_len = ((VEC_LEN - sd_half) + (stride - 1)) /
stride) > 0)
{ ... }
->
while ((sub_len = (VEC_LEN + sd_half - 1) / stride) > 0) { ... }
N.B. Since we divide strictly by powers of two, this can be
simplified to bit-shifting. Very important! */
while ((sub_len = (VEC_LEN + sd_half - 1) >> stride_log2) > 0) {
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
/* PARALLEL LOOP */
CMOV((i < sub_len), Scalar, \
v[stride*i], COMBINE(v[stride*i], v[stride*i+sd_half]));
}
/* OpenCL barrier ensures that all threads are synced in execution
and local memory consistency state by this point, for the sake
of the sequential outer loop. */
barrier(CLK_LOCAL_MEM_FENCE);
stride *= 2;
stride_log2++;
sd_half *= 2;
}
}
A few considerations are worth mentioning in relation to running
reduce
on a GPU via OpenCL. First of all, GPU memory architecture
allows a fixed size group of threads to have access to a chunk of
common memory called local memory in OpenCL and “shared memory” in
CUDA terminology. The fixed size group of threads generally
corresponds to the width of the hardware SIMD/SIMT vector units.
Other threads outside of one thread group do not have access to the
local memory of other thread groups. To implement an arbitrary reduce
on a GPU, you need to first do a reduce on each individual block of
data, in local memory, then you need to coalesce the results into a
single contiguous block of OpenCL global memory for a second
iteration. It must be global memory because multiple different
work-groups will be accessing it to coalesce the results. Finally, in
OpenCL, a synchronization is required between these outer iterations
by completing one work-group and scheduling another work-group that
waits on the completion event of the previous work-group. This
(ideally) causes global memory to be consistent across different
work-groups, though admittedly we’re fudging on the limits of OpenCL a
little with respect to the limits of global memory consistency.
Please note that the OpenCL 1.0 specification has a reduce
example
that does exactly this, so it’s probably kosher.
Also, as this memory coalescing operation has a rather high cost compared to typical GPU memory access patterns, it might not even make sense to execute on the GPU. In that case, a CPU subroutine can perform the coalescing instead. In the end, what it really comes down to is the system architecture. For a desktop GPU that is connected to the CPU via a PCI-E bus interface, GPU coalescing is probably more efficient, even with the high overhead to do it on the GPU. But, for embedded systems where the GPU has DMA to the CPU main memory on the same chip, CPU coalescing may be more efficient. Also, another obvious advantage of CPU coalescing is that we don’t need to fudge on OpenCL’s global memory consistency model.
Now, let’s introduce parallel unfold
. Thanks to our particular
restriction that the BASE
value must be used as the second operand
to the binary COMBINE
operator, this can trivially be almost the
reverse of a parallel reduce
.
/* Unfold to the vector width in-place.
VEC_LEN must be a power of two. */
void ABSTRACT_UNFOLD(Scalar *v)
{
unsigned sub_len;
v[0] = BASE; /* BASE VALUE */
sub_len = 1;
while (sub_len < VEC_LEN) {
/* BEGIN MAP */
unsigned i;
unsigned accum = v[sub_len-1];
for (i = 0; i < sub_len; i++) {
/* PARALLEL LOOP */
v[i+sub_len] = COMBINE(v[i], accum);
}
/* END MAP */
/* OpenCL barrier ensures that all threads are synced in execution
and local memory consistency state by this point, for the sake
of the sequential outer loop. */
barrier(CLK_LOCAL_MEM_FENCE);
sub_len += sub_len;
}
}
The idea here, similar to parallel reduce, is that we only unfold to
the width of the hardware SIMD/SIMT vector units. But, once we do
that, there’s a different, more efficient, trick up our sleeves that
we can play here, unlike the case of reduce
. Again, this hinges on
some restrictions we imposed on unfold
, otherwise it would be more
similar to a reverse parallel reduce
. Each block can then be
unfolded totally independently without the need to scatter the results
of the previous unfold
.
But wait, before we get into that trick, there’s yet another clever trick we can play here, in relation to OpenCL GPU computing. Do we really have to do an unfold in OpenCL to get a list of consecutive integers? Absolutely not. Since we get thread IDs for indices, we can just use that smack right as our seed value. Then in one fell swoop we generate our list of consecutive integers. Like the other unfold, this routine is confined to a single block, once a single block is initialized you can use the multi-block routine.
Technically, that means our inner loop is not a “map,” strictly speaking, but we seriously need to take advantage of this low-level optimization!
void gpu_unfold_n(int *v)
{
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
/* PARALLEL LOOP */
v[i] = i + 1;
}
}
Now, back to the block-wise expansion trick for unfold
.
void ABSTRACT_UNFOLD_NEXT(int *v, unsigned add_idx, Scalar accum)
{
/* BEGIN MAP */
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
/* PARALLEL LOOP */
v[i+add_idx] = COMBINE(v[i], accum);
}
/* END MAP */
}
Here, you can simply see that we take an existing block and combine it
with an accumulator value to get the new block. The final trick we
need to play here is that we need to be able to compute the
accumulator values in parallel in advance to have a fully parallel
block-wise generator. When this is not possible, it is still possible
to get efficiency gains by using a modified variant of the unfold
for the first block: a little bit more code can be used to generalize
it to a block-wise power-of-two unfold
.
Beyond single higher-order vector functions
Having higher-order vector functions is very useful in getting started writing efficient parallel code. However, many useful subroutines are not just a single higher-order vector function, but a combination of two or more. The most common combination is map-reduce. This is trivial to express in optimal form on a vector processor, but a scalar processor? This is where things get complicated.
A map
can be expressed as an ordinary loop on all the elements. But
a reduce
? We need that special first-case element handling, and
that breaks up our otherwise nice loop format with either requiring a
conditional to special-case the first iteration or loop-unraveling the
first iteration. Fortunately, if there is an “identity value” defined
for the combinator, that is, combining any scalar with the identity
value results in the same scalar value itself, then we can eliminate
the special case first iteration by initializing the accumulator to
the identity value.
As for unfolds, that can be replaced by a generator function that simply yields the next value. So, unfold-map-reduce can be implemented efficiently.
Now this allows us to write two identical loops to process the map-reduce. Why not merge them into one? This is the most efficient means to implement this on a scalar processor. The problem that becomes, here, is this: How will we take advantage of this optimization and still be able to write fully general code?
The answer is that we need to make a few different definitions and classifications of looping constructs. That was the final attribute of “loop programming model” that we’ve hinted at right in the beginning of our definitions.
Let’s take a look at two different ways of implementing a map-reduce, sequential compute model. The first method:
Scalar ABSTRACT_MAP_REDUCE(Sclar *a, Scalar *b, ...) {
Scalar work[VEC_LEN];
/* Initialize to "identity value." */
Scalar accum = IDENT_VAL;
unsigned i;
/* BEGIN MAP */
for (i = 0; i < VEC_LEN; i++) {
/* PARALLEL LOOP */
MAP_OPERS(work[i], i);
/* Result: `work[i]` is written. */
}
/* END MAP */
/* BEGIN REDUCE */
for (i = 0; i < VEC_LEN; i++) {
accum = COMBINE(accum, work[i]);
}
return accum;
/* END REDUCE */
}
The second method:
Scalar ABSTRACT_MAP_REDUCE(Sclar *a, Scalar *b, ...) {
Scalar work;
/* Initialize to "identity value." */
Scalar accum = IDENT_VAL;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
/* BEGIN MAP */
MAP_OPERS(work, i);
/* Result: `work` is written. */
/* END MAP */
/* BEGIN REDUCE */
accum = COMBINE(accum, work);
/* END REDUCE */
}
return accum;
}
In the first implementation, we basically just pasted a map
and a
reduce
together to create the map-reduce higher-order vector
function. But in the second, we built off of the general structure of
the first implementation to make an optimization in both sequential
compute time and memory consumption: we combined the two identical
loop control structures into one, and we used a scalar for work
rather than a vector. Another optimization that we could make, but do
not strictly need to make, is to move work
to the inside of the loop
body since it does not need to persist between loop iterations.
One important property about map
and reduce
(under our
restrictions) is that their iterations are “commutative.” That means
we can also restructure the loop to count to zero, reducing the
register burden of the loop control structure.
Scalar ABSTRACT_MAP_REDUCE(Sclar *a, Scalar *b, ...) {
/* Initialize to "identity value." */
Scalar accum = IDENT_VAL;
unsigned i = VEC_LEN;
while (i > 0) {
Scalar work;
i--;
/* BEGIN MAP */
MAP_OPERS(work, i);
/* Result: `work` is written. */
/* END MAP */
/* BEGIN REDUCE */
accum = COMBINE(accum, work);
/* END REDUCE */
}
return accum;
}
What about combining unfold
with other higher-order vector
functions? At the outset, this is pretty much just the same as
reduce
. Unfortunately, our principal definition of unfold
is not
associative: you cannot rearrange the parenthetical groupings in any
way other than left associativity. In essence, a sequential compute
model unfold
is, in fact, a generator function. So, unless we do
slighly esoteric programming such as using negative indices from the
end of a vector, we must use the for-loop style loop control. For
illustrative purposes, let’s show an unfold-map-reduce.
Scalar ABSTRACT_UNFOLD_MAP_REDUCE(...) {
Scalar unfold_accum = BASE;
/* Initialize to "identity value." */
Scalar accum = IDENT_VAL;
unsigned i;
for (i = 0; i < VEC_LEN; i++) {
Scalar work;
/* BEGIN UNFOLD */
work = unfold_accum;
unfold_accum = COMBINE(unfold_accum, BASE);
/* END UNFOLD */
/* BEGIN MAP */
MAP_OPERS(work, i);
/* Result: `work` is written. */
/* END MAP */
/* BEGIN REDUCE */
accum = COMBINE(accum, work);
/* END REDUCE */
}
return accum;
}
One thing that becomes apparent is that there are many times when you may want to write the inner body of a loop, but not necessarily pack it into a loop right away. Keep this in mind, we are going to define a formal method to define subroutines that are just the body of a loop, which can then be combined with other subroutines and packed into a loop by a higher-level calling subroutine.
That pretty much completes the survey of different ways to implement the higher-order vector functions. Again, let’s review the key considerations that we may want to optimize.
- Easiest, most intuitive code to program.
- Best performance in parallel compute model.
- Best performance in sequential compute model.
The resulting implementation decisions can be split into two main categories.
- The calling conventions of a subroutine.
- How a subroutine computes its results.
Subroutine calling conventions:
-
Vector (
vec
), compute all vector components by the end of the subroutine. The subroutine itself may use a parallel compute model. -
Scalar (
s
), stateless computation of a single vector component, i.e. the order in which you request the components is not important. The subroutine can readily be called in parallel to compute all components in parallel. -
Ordered scalar (
seq
), stateful computation of a vector component, you must request computation of vector components strictly in ascending index order, without skipping any index. Essentially, this is a generator function or iterator. The subroutine cannot be used in a parallel. -
Synchronized scalar (
sync
), another form of stateful computation of a vector component. This indicates abarrier()
synchronization primitive is used somewhere in the computation of a single vector component. Therefore, execution must be treated as synchronized threads, which basically amounts to avec
calling convention from a data flow standpoint since you cannot separate one scalar result from the rest.To get the
vec
calling convention results, pass this subroutine to a thread job executor subroutine. This subroutine is also useful for building higher-level thread subroutines.Because of the required use of threads, the subroutine can readily be called in parallel to compute all components in parallel.
Methods for a subroutine to compute its results:
- Sequential compute model (seq)
- Parallel compute model (par). Use threads.
So, how does ease of programming fit in here? There is one
combination of calling convention and compute model that results in
the greatest ease of programming: vec
and seq
. Any vector method
you use, it operates on a whole vector before returning. And,
individual map
, reduce
, and unfold
higher-order vector functions
are implemented in a sequential compute model. This combination
choice means that when you combine higher-order vector functions
together, it really is the same as mere aggregation. No special
optimizations are taken advantage of anywhere. Of course, that is the
problem from an compute efficiency standpoint.
Let’s classify the sequential and parallel higher-order vector
function variants by the calling conventions. In particular,
analyzing the loop bodies, the interior of the loop. (After all,
they’re all vec
on the outside.)
map
:s
reduce
: N/A, intermediate computation vias
orsync
unfold
:s, seq, sync
These calling conventions of the loop body determine how many
different ways we can compose higher-level subroutines that call one
of these subroutines. In particular for reduce
, there is only one
possible calling convention, vec
, since the inner loop only computes
an intermediate. By extension, this means that there is also only one
calling convention for any map-reduce operation, vec
. It is only in
the case of map
and unfold
where it may make sense to support more
than one calling convention. In total:
map
:vec, s
reduce
:vec
, intermediate computation vias
orsync
unfold
:vec, s, seq, sync
Depending on the properties of the combinator, not all calling
conventions may be possible for unfold
.
The different calling conventions determine the range of different ways they can be combined together, out of which some ways are the most efficient composite computational operations.
We can declare a single optimization goal for both sequential and parallel compute models:
-
Sequential compute model: Compute on only a single component for as long as possible.
-
Parallel compute model: Compute on only a single thread for as long as possible.
In the case of map
, there is no distinction between threads and
components, so that makes map
easy. reduce
is a terminal
operation on a component or thread group. unfold
spawns components
or threads.
So, we can understand this as the following consideration: subroutines
that derive from map
, unfold
, or both need to provide a
per-component calling convention. If reduce
is used, it is special:
the s
or sync
calling conventions can partially compute the
result, but the final result is only available from the vec
calling
convention. So, in both cases for reduce
, we define a special
subroutine (or macro) that only computes part of the result, and it is
up to the user to include that as the last fragment and close up the
loop or thread primitive to complete the computation.
However, we can apply a special case optimization in the event that a
parallel reduce
fits the width of the hardware SIMD/SIMT vector
units. In this case, we can simply combine the reduce
fragment into
a thread subroutine, and subsequent thread execution can reference the
computed value.
Okay, now that’s all quite a mouthful to describe. We need code examples to make this clear. First, let’s define the abstract building blocks that we need as macros. However, for ease of comprehension, we will not actually define them as C macros, but as C functions representing the abstract operations. It should be fairly straightforward to see how these could be transformed into macros that generate the respective concrete functions.
/* Here's the general idea. Any time you use a `map` followed by a
`reduce`, you must structure your code so that you generate a
map-reduce subroutine. You then call that map-reduce subroutine
from your higher-level code.
For simplicity, let's start by only working out map-reduce. We'll
work out unfold later. N.B. If unfold after the first block can be
implemented as a fully parallel map, that trivially also applies
here. */
/* Easiest case to implement, already described previously.
Sequential compute model, abstract map-reduce. */
#define CONCRETE_MAP_REDUCE(NAME, MAP_OPERS, IDENT_VAL, COMBINE) \
void NAME(Sclar *a, Scalar *b, ...) { ... }
/* Now, the tricky part to implement is, of course, the parallel
subroutine. We actually need a whole host of generic supporting
subroutines. */
/* Notes on special ALL CAPITALS variables:
* `BLOCK_LEN`: Width of an OpenCL work-group, i.e. width of the
hardware SIMD/SIMT vector units.
* `VEC_LEN`: Total allocated storage in a vector. Accessing
elements at indices greater than or equal to this is undefined.
Typically, this will be padded out to be a multiple of
`BLOCK_LEN` so that OpenCL parallel compute jobs can be scheduled
to operate on the vector.
* `REAL_LEN`: The unpadded length of the actual data contained
within the vector. Elements at indices greater than or equal to
`REAL_LEN` are strictly padding and do not correspond to the
original data. */
/* Thread subroutine: Coalesce `in` data spaced by `stride` to be
packed in `out` data. */
void coalesce_stride_thsubr(THCtx *ctx,
Scalar *out, scalar *in,
unsigned REAL_LEN, unsigned stride)
{
unsigned i = GET_GLOBAL_TH_IDX(ctx);
/* Depends on mode whether we make this check, if it is possible
that we are not aligned to the block size. */
CMOV((i < REAL_LEN), Scalar, out[i], in[stride*i]);
}
/* Thread subroutine: Run one thread of the parallel `reduce`
subroutine to reduce in-place. Only each individual block is
reduced. You must then use a thread job executor to reduce these
results down to a single scalar.
The result of the reduction depends on mode, either it is stored in
the first element of each block of the vector OR it is stored
contiguously at the first elements of the vector.
PLEASE NOTE: We rewrite the vector in place as our working memory,
so pass a copy!
BLOCK_LEN must be a power of two. */
void ABSTRACT_REDUCE_BLOCK_THSUBR(THCtx *ctx, Scalar *work,
unsigned REAL_LEN) {
unsigned stride = 2;
unsigned stride_log2 = 1;
unsigned sd_half = 1; /* Half stride */
unsigned sub_len;
unsigned i = GET_GLOBAL_TH_IDX(ctx);
unsigned my_blk_len = BLOCK_LEN;
/* Depends on mode whether we do this at all. If reduce operates on
non-power-of-two input. */
CMOV((i >= (REAL_LEN & ~(BLOCK_LEN - 1))), unsigned, \
my_blk_len, REAL_LEN & (BLOCK_LEN - 1));
/* sub_len computation is reformulation of the following:
while (stride * i + sd_half < my_blk_len) { ... } */
-> reorganize with correct rounding ->
while ((sub_len = ((my_blk_len - sd_half) + (stride - 1)) /
stride) > 0)
{ ... }
->
while ((sub_len = (my_blk_len + sd_half - 1) / stride) > 0) { ... }
N.B. Since we divide strictly by powers of two, this can be
simplified to bit-shifting. Very important! */
while ((sub_len = (my_blk_len + sd_half - 1) >> stride_log2) > 0) {
CMOV((i < sub_len), Scalar, \
work[stride*i], COMBINE(work[stride*i], work[stride*i+sd_half]));
/* OpenCL barrier ensures that all threads are synced in execution
and local memory consistency state by this point, for the sake
of the sequential outer loop. */
barrier(CLK_LOCAL_MEM_FENCE);
stride *= 2;
stride_log2++;
sd_half *= 2;
}
/* Depends on mode whether we do this here. Coalesce the results. */
CMOV((GET_LOCAL_TH_IDX(ctx) == 0), Scalar, \
work[GET_GROUP_IDX(ctx)], work[i]);
}
/* Thread subroutine: `map` one element, then run one thread of the
parallel `reduce` subroutine. You must use a thread job executor
to map-reduce an entire vector. The entire vector is mapped, but
only each individual block is reduced. You must then use a thread
job executor to reduce these results down to a single scalar.
The result of each block-wise reduction is stored in the first
scalar of each `BLOCK_LEN` block of `out`. The rest of each
`BLOCK_LEN` in `out` is used as working scratch space. */
void ABSTRACT_MAP_REDUCE_BLOCK_THSUBR(THCtx *ctx, Scalar *out,
unsigned REAL_LEN, ...) {
/* BEGIN MAP */
unsigned i = GET_GLOBAL_TH_IDX(ctx);
/* Code structure here depends on mode. */
Scalar temp;
CMOV_CODE_BLK((i < REAL_LEN), Scalar, out[i], temp,
MAP_OPERS(temp, i));
/* NOTE: If `reduce` only operates on power-of-two input, then `map`
will write out `IDENT_VAL` padding to ensure the `out` vector is
filled respectively. */
CMOV((i >= REAL_LEN), Scalar, out[i], IDENT_VAL);
/* Result: `out` is written. */
/* END MAP */
/* OpenCL barrier ensures that all threads are synced in execution
and local memory consistency state by this point, so that the
following `reduce` is consistent. */
barrier(CLK_LOCAL_MEM_FENCE);
/* return */ ABSTRACT_REDUCE_BLOCK_THSUBR(ctx, out, REAL_LEN);
}
/* Thread subroutine: Generate a vector with a length (dimension)
aligned to a multiple of `BLOCK_LEN` by padding out remainder items
with the given value. */
void pad_align_thsubr(THCtx *ctx,
Scalar *work, unsigned REAL_LEN, Scalar pad_value)
{
unsigned i = GET_GLOBAL_TH_IDX(ctx);
CMOV((i >= REAL_LEN), Scalar, work[i], pad_value);
}
/* Sequential compute model alternative. Take a direct pointer and
length and write the padding directly. */
void pad_align_seq(Scalar *work, unsigned pad_len, Scalar pad_value)
{
while (pad_len > 0) {
pad_len--;
*work++ = pad_value;
}
}
/* CALLING CONVENTION of `sched_th_job()` EXAMPLE:
unsigned roundup_len = (REAL_LEN + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
int err_val = sched_th_job(
roundup_len, /* Total thread count (global_work_size) */
BLOCK_LEN, /* Work-group block size (local_work_size) */
0, NULL, /* List of events to wait for completion before starting. */
NULL, /* Pointer to store resulting event object */
coalesce_stride_thsubr, /* Function pointer to subroutine */
... /* Function arguments, not including thread context */
);
*/
/* PLEASE NOTE: We use some this OpenCL API convention for
convenience. OpenCL does provide adequate garbage collection
features on event objects. Here's how it works.
1. Save your event wait outputs before creating your new event.
2. Create your new event with the wait list, save the new event.
3. Call `clReleaseEvent()` on all events in the wait list. The
reference count will drop, but the events objects will not be
freed until the event itself has completed and any events
waiting on this event no longer require a reference, i.e. they
have already issued execution.
And, that's actually how OpenCL does the memory management of event
objects, via reference counts. Which work, by the way, because
circular dependencies are strictly verboten. */
/* Thread job executor for finish `reduce`, thread job executor on the
`reduce` concrete thread subroutine multiple times until there is
only one element remaining.
BLOCK_LEN must be a power of two so that BLOCK_LEN_LOG2 can be used
for fast divide by shift right.
The final result is in `work[0]`. */
void ABSTRACT_REDUCE_ALL(unsigned wait_list_len, cl_event *wait_list,
cl_event *out_event,
Scalar *work, unsigned REAL_LEN)
{
cl_event wait_event, old_wait_event;
unsigned sub_len = REAL_LEN >> BLOCK_LEN_LOG2;
while (sub_len > 1) {
unsigned align_len = (sub_len + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
/* Depends on mode whether we do this here. */
sched_th_job(align_len, BLOCK_LEN,
wait_list_len, wait_list, &wait_event,
coalesce_stride_thsubr, work, work, sub_len, BLOCK_LEN);
while (wait_list_len > 0) {
wait_list_len--;
clReleaseEvent(*wait_list++);
}
old_wait_event = wait_event;
/* Check if we need to add padding to fill `BLOCK_LEN`. */
if (sub_len != align_len) {
/* unsigned rem = sub_len & (BLOCK_LEN - 1);
unsigned frag = sub_len - rem; */
/* Ideally we'd compute on only the subset
`work[frag...frag+rem]`. But OpenCL 1.0 doesn't have
convenient syntax to support that, so just compute on
everything. PLEASE NOTE: OpenCL 1.1 added support for
sub-buffer objects, clCreateSubBuffer(). */
/* N.B.: This could possibly be faster on the CPU in a Unified
Memory Architecture but slower on non-unified architectures.
So it depends on mode whether we execute on the GPU. */
sched_th_job(align_len, BLOCK_LEN, 1, &old_wait_event, &wait_event,
pad_align_thsubr, work, sub_len, IDENT_VAL);
clReleaseEvent(old_wait_event);
old_wait_event = wait_event;
}
sched_th_job(align_len, BLOCK_LEN, 1, &old_wait_event, &wait_event,
ABSTRACT_REDUCE_BLOCK_THSUBR, work, sub_len);
clReleaseEvent(old_wait_event);
old_wait_event = wait_event;
sub_len >>= BLOCK_LEN_LOG2;
}
*out_event = old_wait_event;
}
/* MAP-REDUCE TOP-LEVEL CONTROL SUBROUTINE. Result is returned in
`out[0]`, the remainder is used as working scratch space. */
void ABSTRACT_MAP_REDUCE(unsigned wait_list_len, cl_event *wait_list,
cl_event *out_event,
Scalar *out, unsigned REAL_LEN, ...)
{
cl_event wait_event, old_wait_event;
/* NOTE: Calculating `align_len` depends on mode, otherwise we just
copy `REAL_LEN` as they are one in the same. */
unsigned align_len = (REAL_LEN + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
/* Check if we need to add padding to fill `BLOCK_LEN`. Depends on
mode. */
if (REAL_LEN != align_len) {
/* NOTE: We will need to run pad jobs for each array input
argument. Here, for example, we assume there is at least one
named `b`. */
sched_th_job(align_len, BLOCK_LEN,
wait_list_len, wait_list, &wait_event,
pad_align_thsubr, b, REAL_LEN, IDENT_VAL);
while (wait_list_len > 0) {
wait_list_len--;
clReleaseEvent(*wait_list++);
}
old_wait_event = wait_event;
/* And so on ... */
/* Then we finish off doing this so that the rest of the code runs
cleanly. */
wait_list_len = 1;
wait_list = &old_wait_event;
}
sched_th_job(align_len, BLOCK_LEN,
wait_list_len, wait_list, &wait_event,
ABSTRACT_MAP_REDUCE_BLOCK_THSUBR, out, REAL_LEN, ...);
while (wait_list_len > 0) {
wait_list_len--;
clReleaseEvent(*wait_list++);
}
old_wait_event = wait_event;
if (REAL_LEN > BLOCK_LEN) {
ABSTRACT_REDUCE_ALL(1, &old_wait_event, &wait_event, out, REAL_LEN);
old_wait_event = wait_event;
}
*out_event = old_wait_event;
}
/* Okay, so now here's the trick. When `REAL_LEN` is less than
`BLOCK_LEN`, we can take advantage of an optimization trick. We
can build higher-order thread subroutines by dropping the initial
`ABSTRACT_MAP_REDUCE_BLOCK_THSUBR()` right in place. But we also
want to be flexible so that the same higher-level code also
flexibly handles the case where `REAL_LEN > BLOCK_LEN`. And, to
keep the whole code compact, we also want the sequential compute
model to be handled the same.
So, here's how it works. At the base level, there is a "maybe
thread job executor" that will run the thread subroutine as-is if
`REAL_LEN <= BLOCK_LEN`, otherwise it will use the thread job
executors to complete the computation.
The next higher level is the "maybe thread subroutine." If the
lower level was in fact generated as a thread subroutine, then this
higher-level subroutine is also a thread subroutine. Otherwise,
this is a host subroutine that is in fact a thread job executor by
virtue of the thread job execution of the subroutines.
So, with the "maybe thread subroutine" (MTS) and "maybe thread job
executor" (MTJE). Mapping out mddes and restrictions.
seq (REAL_LEN <= BLOCK_LEN) (REAL_LEN > BLOCK_LEN)
MTJE loop close thsubr host subr, sched_th_job
MTS vec thsubr host subr, sched_th_job
Okay... so, really, it is the same, `loop close` also results in a
`vec` calling convention. That makes it all pretty easy. The
sequential case basically maps out to the same treatment as
`REAL_LEN > BLOCK_LEN` from a calling convention standpoint.
*/
void ABSTRACT_MAP_REDUCE_MAYBE_SCTHJ(THCtx *ctx, Scalar *out,
unsigned REAL_LEN, ...)
{
if (REAL_LEN <= BLOCK_LEN)
ABSTRACT_MAP_REDUCE_BLOCK_THSUBR(ctx, out, REAL_LEN, ...);
else
ABSTRACT_MAP_REDUCE(out, REAL_LEN, ...);
}
/* Above the "maybe thread subroutine", of course we need another
"maybe thread job executor". */
void ABSTRACT_MAYBE_SCTHJ(THCtx *ctx,
unsigned wait_list_len, cl_event *wait_list,
cl_event *out_event,
void (*maybe_thsubr)(), ...)
{
cl_event wait_event, old_wait_event;
if (REAL_LEN <= BLOCK_LEN) {
unsigned align_len = (REAL_LEN + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
/* Check if we need to add padding to fill `BLOCK_LEN`. Depends
on mode. */
if (REAL_LEN != align_len) {
/* NOTE: We will need to run pad jobs for each array input
argument. Here, for example, we assume there is at least one
named `b`. */
sched_th_job(align_len, BLOCK_LEN,
wait_list_len, wait_list, &wait_event,
pad_align_thsubr, b, REAL_LEN, IDENT_VAL);
while (wait_list_len > 0) {
wait_list_len--;
clReleaseEvent(*wait_list++);
}
old_wait_event = wait_event;
/* And so on ... */
/* Then we finish off doing this so that the rest of the code runs
cleanly. */
wait_list_len = 1;
wait_list = &old_wait_event;
}
sched_th_job(align_len, BLOCK_LEN,
wait_list_len, wait_list, &wait_event,
maybe_thsubr, ...);
while (wait_list_len > 0) {
wait_list_len--;
clReleaseEvent(*wait_list++);
}
old_wait_event = wait_event;
*out_event = old_wait_event;
}
else
maybe_thsubr(...);
}
/* And you can see this is now a general pattern. We have two
subroutines, the thread subroutine itself for special cases and the
thread job executor for everything else. At each level, we have a
decision point about which one to call. Even if REAL_LEN is not
fixed at compile-time, we can still gracefully handle both cases at
run-time. We expose two subroutines when declaring, we always use
the dispatcher when calling.
Sequential compute model is obviously special. Since we only
support one case (`vec` calling convention), the dispatcher
subroutine disappears and it **is** the same as the underlying
subroutine. Additionally, any calls to the "maybe thread
subroutine" call a single iteration of the loop (i.e. compute a
single component), so the outer "maybe thread subroutine" is
likewise not a thread subroutine. This will be more clearly
demonstrated in a later example. */
Now, that was a heck-lot of complex code to take in. Yeah, it’s definitely all in the complexity and sophistication of the ideas involved, not so much the length of the code itself. But, that’s the point. The point is to present a simplified example that, though incomplete from an implementation standpoint, is easier to conceptually understand. Once you are familiar with this code, it should be a lot easier to understand the real implementation code and its motivations. Though the real implementation code is not that much different in structure, arguably it is different enough to confuse a novice not familiar with the concepts and structure.
Now, let’s consider the higher-level. Once you define all those higher-order vector functions generically as a lower level, what does that make higher-level programming look like? It absolutely works wonders. Now, you can suddenly leverage the computational power of GPU parallel computing with a programming model that doesn’t look too much different from your familiar programming primitives available in other high-level programming languages with simple functional programming constructs, such as Python for example.
/* NOTE: We have a little bit of pseudo-code here, an actual
implementation would need an elegant way to specify additional
arguments. Maybe the most portable way is to just pass a data
structure with the remaining arguments? Wrapper functions can then
be defined that expose the specific number of additional arguments.
No, actually probably the best way is to user-define the whole
function prototype, structures won't work well with OpenCL.
Another thing we need to do, we need the ability to specify the
`work` temporary scalar/vector should be declared as a Bool rather
than a Scalar.
Maybe the easiest way is to define a special set of code generators
that are stated to be Bool type intermediates?
Also note, in these generated map-reduce subroutines, In parallel
compute models, `out` is not just a pointer to write the output
result, but it is also a pointer to a sufficiently large scratch
space vector to use during intermediate computations. This is a
technicality of the underlying code implementation, more work could
probably be done to make a nicer high-level programmer interface
while still preserving efficiency and generality. */
/* For simplicity, define booleans to be one byte wide, although it is
most efficient to define then as 1-bit bit-fields. */
typedef unsigned char Bool;
/* Also, PLEASE NOTE: I totally forgot about passing around the thread
context object (`ctx`) as a function argument, that's okay. For
the sake of simplicity of presenting the high-level coding ideas
here, it is generally not necessary. */
/* Test if all components are equal to zero. Map-reduce
implementation. */
#define IS_ZERO_MAP(work, i) (work) = (b[i] == 0);
#define IS_ZERO_COMBINE(b, c) (b && c)
CONCRETE_MAP_REDUCE_BOOL(is_zero, IS_ZERO_MAP, 1, IS_ZERO_COMBINE);
#undef IS_ZERO_MAP
#undef IS_ZERO_COMBINE
/* Generated function prototypes: */
void is_zero_maybe_thsubr(Bool *out, Scalar *b);
void is_zero(Bool *out, Scalar *b);
/* Test if all components are equal to the `NOSOL` magic number,
indicating that there is no solution when attempting to solve a
particular system of equations. Map-reduce implementation. */
#define IS_NOSOL_MAP(work, i) work = (b[i] == NOSOL);
#define IS_NOSOL_COMBINE(b, c) (b && c)
CONCRETE_MAP_REDUCE_BOOL(is_nosol, IS_NOSOL_MAP, 1, IS_NOSOL_COMBINE);
#undef IS_NOSOL_MAP
#undef IS_NOSOL_COMBINE
/* Generated function prototypes: */
void is_nosol_maybe_thsubr(Bool *out, Scalar *b);
void is_nosol(Bool *out, Scalar *b);
/* Test if two vectors are equal Map-reduce implementation.. */
#define IS_EQUAL_MAP(work, i) work = (b[i] == c[i])
#define IS_EQUAL_COMBINE(b, c) (b && c)
CONCRETE_MAP_REDUCE_BOOL(is_equal, IS_EQUAL_MAP, 1, IS_EQUAL_COMBINE);
#undef IS_EQUAL_MAP
#undef IS_EQUAL_COMBINE
/* Generated function prototypes: */
void is_equal_maybe_thsubr(Bool *out, Scalar *b, Scalar *c);
void is_equal(Bool *out, Scalar *b, Scalar *c);
/* Arithmetic mean. Map-reduce implementation, requires
floating point concrete type for Scalar, doing divide as a `map`
step mainly for illustrative purposes, though arguably it is more
efficient to do after `reduce`. */
#define ARMEAN_MAP(work, i) work = b[i] / REAL_LEN
#define ARMEAN_COMBINE(b, c) (b + c)
CONCRETE_MAP_REDUCE(armean, ARMEAN_MAP, 0, ARMEAN_COMBINE);
#undef ARMEAN_MAP
#undef ARMEAN_COMBINE
/* Generated function prototypes: */
void armean_maybe_thsubr(Scalar *out, Scalar *b);
void armean(Scalar *out, Scalar *b);
/* Dot product. Map-reduce implementation, requires floating point
concrete type for Scalar. If we wanted to support integer
arithmetic, the implementation details would be somewhat similar to
what we need to do for Bool: use a double-width integer type for
working scratch space and the final result. */
#define DOT_MAP(work, i) work = b[i] * c[i]
#define DOT_COMBINE(b, c) (b + c)
CONCRETE_MAP_REDUCE(dot, DOT_MAP, 0, DOT_COMBINE);
#undef DOT_MAP
#undef DOT_COMBINE
/* Generated function prototypes: */
void dot_maybe_thsubr(Scalar *out, Scalar *b, Scalar *c);
void dot(Scalar *out, Scalar *b, Scalar *c);
/********************************************************************/
/* Okay, that was a good number of easy examples, let's go beyond the
basics. Now, let's consider the implementation of higher-order
composite subroutines, one where a map-reduce is only but part of
the overall operation. Specifically, we'll just start by taking a
loop at map-reduce-map subroutines since they are illustrative of
the general principle that is just repeated for even higher-order
subroutines. */
/* Simplified macro definition: CONCRETE_MAP */
#define CONCRETE_MAP(NAME, MAP_OPERS)
/* Generated function prototypes (example): */
void NAME_thsubr(Scalar *out, ...);
void NAME(Scalar *out, ...);
/* Depends on mode, cast a vector `reduce` result to a scalar in
vector mode, else just pass through the scalar. */
#define RR_SCALAR(v) (*(v))
/* Sequential compute model (i.e. "Scalar" mode): */
/* #define RR_SCALAR(v) (v) */
/* The reverse conversion of `RR_SCALAR()`. */#
define RR_VECTOR(V) (v)
/* Sequential compute model (i.e. "Scalar" mode): */
/* #define RR_VECTOR(v) (&(v)) */
/* Depends on mode, perform the indicated operation only on the
`reduce` result if possible (i.e. operating in sequential compute
model), but if that's not possible, we don't care if the
computation is repeated across all vector components. */
#define RR_CONCRETE_MAP(NAME, MAP_OPERS) \
void NAME##_maybe_thsubr(Scalar *out, ...) \
{ unsigned i = GET_GLOBAL_TH_IDX(ctx); \
MAP_OPERS(out[i], i); } \
void NAME(Scalar *out, ...) \
{ unsigned i = GET_GLOBAL_TH_IDX(ctx); \
MAP_OPERS(out[0], 0); }
/* Sequential compute model */
/* #define RR_CONCRETE_MAP(NAME, MAP_OPERS) \
void NAME##_maybe_thsubr(Scalar *out, ...) \
{ MAP_OPERS(out[0], 0); } \
void NAME#(Scalar *out, ...) \
{ MAP_OPERS(out[0], 0); } */
/* Map only: Conditionally set all vector components to `val`. */
#define SCALAR_CMOV_MAP(work, i) CMOV(cond, Scalar, work, val);
CONCRETE_MAP(scalar_cmov, SCALAR_CMOV_MAP);
#undef SCALAR_CMOV_MAP
/* Generated function prototypes: */
void scalar_cmov_thsubr(Scalar *out, Bool cond, Scalar val);
void scalar_cmov(Scalar *out, Bool cond, Scalar val);
/* Map only: Multiply-divide floating point scalars only, due to
previous mentioned technicalities with the dot product.. */
#define MULDIV_MAP(work, i) work = b[i] * c / d
CONCRETE_MAP(muldiv, MULDIV_MAP);
#undef MULDIV_MAP
/* Generated function prototypes: */
void muldiv_thsubr(Scalar *out, Scalar *b, Scalar c, Scalar d);
void muldiv(Scalar *out, Scalar *b, Scalar c, Scalar d);
/* Map only: Subtract vectors. */
#define SUB_VEC_MAP(work, i) work = b[i] - c[i]
CONCRETE_MAP(sub_vec, SUB_VEC_MAP);
#undef SUB_VEC_MAP
/* Generated function prototypes: */
void sub_vec_thsubr(Scalar *out, Scalar *b, Scalar *c);
void sub_vec(Scalar *out, Scalar *b, Scalar *c);
#define SQRT_RR_MAP(work, i) work = sqrt(b[i])
RR_CONCRETE_MAP(sqrt_rr, SQRT_VEC_MAP);
#undef SQRT_RR_MAP
/* Generated function prototypes: */
void sqrt_rr_maybe_thsubr(Scalar *out, Scalar *b);
void sqrt_rr(Scalar *out, Scalar *b);
#define SCALAR_RR_CMOV_MAP(work, i) CMOV(cond, Scalar, work, val)
RR_CONCRETE_MAP(scalar_rr_cmov, SCALAR_RR_CMOV_MAP);
#undef SCALAR_RR_CMOV_MAP
/* Generated function prototypes: */
void scalar_rr_cmov_maybe_thsubr(Scalar *out, Bool cond, Scalar var);
void scalar_rr_cmov(Scalar *out, Bool cond, Scalar var);
#define SCALAR_SUB_RR_MAP(work, i) work = b[i] - c
RR_CONCRETE_MAP(scalar_sub_rr, SCALAR_SUB_RR_MAP);
#undef SCALAR_SUB_RR_MAP
/* Generated function prototypes: */
void scalar_sub_rr_maybe_thsubr(Scalar *out, Scalar *b, Scalar c);
void scalar_sub_rr(Scalar *out, Scalar *b, Scalar c);
/* Compute vector magnitude squared. */
void magn2q_maybe_thsubr(Scalar *out, Scalar *b)
{
dot_maybe_thsubr(RR_VECTOR(out), b, b);
}
/* Explained in depth later. */
THSUBR_EXECUTOR(magn2q, magn2q_maybe_thsubr);
/* Compute vector magnitude. */
void magnitude_maybe_thsubr(Scalar *out, Scalar *b)
{
magn2q_maybe_thsubr(RR_VECTOR(out), b);
sqrt_rr_maybe_thsubr(RR_VECTOR(out), RR_VECTOR(out));
}
/* Explained in depth later. */
THSUBR_EXECUTOR(magnitude, magnitude_maybe_thsubr);
/* Project a point onto a vector. Only works for floating point
scalars, due to previous mentioned technicalities with the dot
product.
proj_a_on_b(A, B) = B * dot_product(A, B) / magnitude(B)
If there is no solution, the resulting point's coordinates are all
set to NOSOL.
Scratch space `n` and `d` must be pre-allocated to `VEC_LEN` if
they are vectors. Otherwise, they may simply be scalars.
N.B. For efficient general-case computation, we use conditional
moves and redundant computation for the special cases where there
is no solution.
*/
void proj_a_on_b_maybe_thsubr(Scalar *out, Scalar *b, Scalar *c,
MAYBE_VEC n, MAYBE_VEC d)
{
Bool out_nosol;
magnitude_maybe_thsubr(RR_VECTOR(d), c);
out_nosol = (RR_SCALAR(d) == 0);
/* Avoid divide by zero. */
scalar_rr_cmov_maybe_thsubr(RR_VECTOR(d), out_nosol, 1);
dot_maybe_thsubr(RR_VECTOR(n), b, c);
SEQ_LOOP_BEGIN {
muldiv_thsubr(out, c, RR_SCALAR(n), RR_SCALAR(d));
/* Nullify output if there is in fact no solution. */
scalar_cmov_thsubr(out, out_nosol, NOSOL);
} SEQ_LOOP_END;
}
/* Essentially, our `maybe_thsubr` we've defined is the inner loop of
a map, but this time because we're working with more lines of code,
we've used a different syntax. Also, the fact that `maybe_thsubr`
subroutines are included means this needs to be treated
differently. Nevertheless, a macro is used to generate the thread
job executor or subroutine alias, which one is applicable depends
on mode.
Unfortunately, to maintain our compatibility with OpenCL, our
working scratch space variable handling has become a bit clumbsy,
but we'll figure out the right syntax and implementation.
Also note the use of `SEQ_LOOP_BEGIN { ... } SEQ_LOOP_END`. We
need this construct due to our mixing of `reduce` and `map`. In
the sequential compute model, normally a `map_maybe_thsubr` goes
straight to computing only a single component, so we run that in a
loop to compute all components. But if we use `reduce` in the
middle, we don't want to run that in a loop because that will be
redundant and even inaccurate if variables are being reused. So we
need to explicitly specify which computations should be looped, and
it is permissible to use more than one such loop construct in a
"maybe thread subroutine". In the parallel compute model, there is
no need dfor distinction because it is perfectly fine if we compute
identical values multiple times in parallel across different
threads, and that is typically what we will prefer to do when there
are such scalar variables that are identical across threads. */
THSUBR_EXECUTOR(proj_a_on_b, proj_a_on_b_maybe_thsubr);
/* Project a point to a plane along the perpendicular:
Let L = location vector of point,
A = plane surface normal vector,
d = plane offset from origin
plane equation = Ax - d = 0
L - A * dist_point_to_plane(L, A, d) / magnitude(A)
Because we are dividing by magnitude twice, the division quantity
is squared so we therefore can avoid computing the square root and
simplify as follows:
L - A * (dot_product(L, A) - d) / magnitude^2(A)
If there is no solution, the resulting point's coordinates are all
set to NOSOL.
*/
void proj_pt_plane_maybe_thsubr(Scalar *out, Scalar *b,
Scalar *c, Scalar offset,
MAYBE_VEC n, MAYBE_VEC d, MAYBE_VEC t)
{
Bool out_nosol;
magn2q_maybe_thsubr(d, c);
out_nosol = (RR_SCALAR(d) == 0);
/* Avoid divide by zero. */
scalar_rr_cmov_maybe_thsubr(RR_VECTOR(d), out_nosol, 1);
dot_maybe_thsubr(RR_VECTOR(n), b, c);
scalar_sub_rr_maybe_thsubr(RR_VECTOR(n), RR_VECTOR(n), offset);
SEQ_LOOP_BEGIN {
muldiv_thsubr(t, c, RR_SCALAR(n), RR_SCALAR(d));
sub_vec_thsubr(out, b, t);
/* Nullify output if there is in fact no solution. */
scalar_cmov_thsubr(out, out_nosol, NOSOL);
} SEQ_LOOP_END
}
THSUBR_EXECUTOR(proj_pt_plane, proj_pt_plane_maybe_thsubr);
/* Please note: In some places in our previous subroutine, there were
areas where the thread subroutines were so simple they could have
been inlined with the direct C math operator syntax. However, we
used the functional form for the purpose of better demonstrating
the generality of the code structuring. */
Whoah! Now that is really a powerful programming experience, feels like flying high at 20,000 feet, cruising at Mach 0.8. Yet still, there is room for improvement.
Code generation remarks
Some other remarks on parallel vector compute are worth mentioning.
Does it always make sense to compute in parallel on vector data
structures at any given opportunity? Sure, it is always possible, but
that does not mean it is always ideal. The prime problem with
parallel reduce
is that it does not keep all parallel worker threads
busy at all times. The first iteration has all threads busy, but then
as the parallel stages decrease, the number of effectively idle
threads increase. With SIMD/SIMT architecture, since identical
instruction streams must be executed in lock-step, you can’t really
schedule these threads to do anything else. Now, if there were
matching binary tree parallel unfold
operations to be done, that
could potentially balance everything out, but reduce
operations are
so much more common than unfold
that this simply can’t be the case.
Additionally… some very simple GPU architectures like VideoCore IV on the Raspberry Pi can’t really do a parallel reduce efficiently due to the lack of adequate features and functions for memory access.
So, here’s the point to consider. If you have algorithms with reduce stages in the middle, consider configuring code generation of the “inner loop” entirely using the sequential compute model. Often times, vector code has a higher level outer loop that is effectively a simple map of all the complex operations inside. So, the point here is to define only the top-level as parallel threads, and ensure that all lower-level operations are structured so that the parallel threads can execute identical instruction streams in lock-step. Namely, this can be easily achieved by the use of conditional moves instead of conditional branches and structuring loops to use identical iteration counts across threads, again using conditional move nullification if any need to diverge arises. This allows the threads to be auto-vectorized into SIMD/SIMT instructions to run efficiently on the GPU.
The main caveat to watch out for here is the fact that GPUs use
simplified architectures that make them less opportune for the
sequential compute model. For example, a GPU typically has
pipelining, but it may have not have out-of-order execution, branch
prediction, and speculative execution. So, small loops (and reduce
operations on small vectors) may be most efficiently executed when
they are unrolled (ideally done automatically by the compiler), and
you want your sequential computations to be structured to minimize
pipeline hazards as much as possible. Namely, by avoiding dependent
computations in rapid sequence and trying to structure your code so
that it is possible for many non-dependent computations to be computed
in close proximity to each other. In the end, these code generation
considerations might come down to benchmarking to determine which
structure of code generation is better tuned to a particular GPU. It
is my goal to make my library flexible enough to support optimal code
generation via parametric tuning.
A simplifying assumption was made in most of my parallel compute model
code related to OpenCL work-groups. Typically, an OpenCL work-group
corresponds to one “processor core” on a GPU. My code simply assumes
that the entire data is broken up into BLOCK_LEN
sized blocks (i.e.,
the width of the hardware SIMD/SIMT vector units) and as many of those
blocks are executed in parallel as possible (i.e. by running across
multiple GPU processor cores). When there are more work-groups than
processor cores, the OpenCL runtime will handle dividing up the work
so that the work-groups sequentially execute on the available GPU
processor cores. Of course, if you’re writing the low-level code with
this knowledge, you can better optimize it by only requesting as many
work-groups as there are GPU processor cores, then using loops inside
each work-group to process all the blocks of data. This is more
efficient because setup and tear-down computations that are common
across work-groups can then be coalesced so that they only run once
per GPU processor core.
Finally, some other minor points related to OpenCL implementation efficiency. The code I presented is not completely annotated with memory address space disciplines, but generally it should be obvious that any input/outputs should go to global memory and intermediate computations should go to local memory, with additional copy stages to move back and forth if necessary. Another important memory constraint is GPU memory limits. If the data you are processing exceeds the size of available GPU memory, then you will need to break the data up into “tiles” and process each tile one at a time, sending and receiving to CPU main memory exchange results and the data for the next tile.
Parallel power series computation example
Let’s consider an even more complex example of unfold
, compute a
sequence of factorials. Once we implement this last operation, we
have all the pieces of the puzzle necessary to show a very useful and
impressive example: computing a power series in parallel. The
strength of power series computations is their generality and
familiarity, they can be used to evaluate a large range of
single-variable differentiable functions and is therefore common
knowledge to anyone who has studied single-variable Calculus. The
fact that we can run this computation in parallel means that we can
present a familiar interface to mathematical programmers who will be
confident that parallel acceleration will be provided to run it at
maximum speed.
The alternative of programming cleverness to creatively refactoring computations to run in parallel is much less attractive. For example, rather than computing pi via a power series, you can compute pi via a statistical area process of generating random points and testing whether they are in the circle or not. You can then compute the weighted average to determine the area in the circle, and compute pi via a few simple arithmetic conversions on that number. But, alas, I digress, historically relying on programming cleverness has been far more common among the few programmers who have optimized our parallel compute algorithms that are used in the mass market for video compression codecs and the like.
So, the last key computational piece that we need to compute a power
series? Compute a sequence of consecutive factorials. Computing a
single factorial is obvious: do an unfold
of consecutive integers,
then a reduce
to determine the product of all those integers.
1 2 3 4 5 6 7 8
1*2 3*4 5*6 7*8
1*2*3*4 5*6*7*8
1*2*3*4*5*6*7*8
But computing a sequence of consecutive factorials is special: in a sequential compute model, normally you’d just multiply the nth next integer in a step-wise fashion, but how do you do this in parallel? The key is in considering what happens to the scratch space when we normally compute just a single factorial in parallel. Let’s take a look.
1, 2, 3, 4, 5, 6, 7, 8
1*2, 2, 3*4, 4, 5*6, 6, 7*8, 8
1*2*3*4, 2, 3*4, 4, 5*6*7*8, 6, 7*8, 8
1*2*3*4*5*6*7*8, 2, 3*4, 4, 5*6*7*8, 6, 7*8, 8
Now, here’s the idea. In the working scratch space, we have space to store partials. We just need to select which partials we keep and get rid of in such a way that we have enough partials to efficiently compute any factorial in the sequence. As it turns out, if we change our strategy to overwriting the last element in a parallel slice rather than the first one, we can have all the partials we need to efficiently compute any factorial.
1, 2, 3, 4, 5, 6, 7, 8
1, 1*2, 3, 3*4, 5, 5*6, 7, 7*8
1, 1*2, 3, 1*2*3*4, 5, 5*6, 7, 5*6*7*8
1, 1*2, 3, 1*2*3*4, 5, 5*6, 7, 1*2*3*4*5*6*7*8
Now, you can see that we can compute 1!
to 8!
efficiently using
these partials.
1! = 1
2! = 1*2
3! = 1*2 * 3
4! = 1*2*3*4
5! = 1*2*3*4 * 5
6! = 1*2*3*4 * 5*6
7! = 1*2*3*4 * 5*6 * 7
8! = 1*2*3*4*5*6*7*8
And in fact, if you look at what is going on carefully, you can
actually decode any binary number into a set of partials that need to
be multiplied together to compute n!
. Here’s what’s going on.
1! -> 0b0001 -> 0b0001 -> a[1]
2! -> 0b0010 -> 0b0010 -> a[2]
3! -> 0b0011 -> 0b0010, 0b0011 -> a[2] * a[3]
4! -> 0b0100 -> 0b0100 -> a[4]
5! -> 0b0101 -> 0b0100, 0b0101 -> a[4] * a[5]
6! -> 0b0110 -> 0b0100, 0b0110 -> a[4] * a[6]
7! -> 0b0111 -> 0b0100, 0b0110, 0b0111 -> a[4] * a[6] * a[7]
8! -> 0b1000 -> 0b1000 -> a[8]
So, you see what’s going on? You start with all bits masked out to
zero, then you progressively reduce the number of bits you have masked
out from most significant bit to least significant bit. If a new
number is revealed, you have a new factor to multiply, which is simply
the index of your number. (Ideally you start your array with 0!
.)
You stop once you have unmasked all bits.
This method can obviously be extended to parallel, namely by asserting
a factor if the bit just unmasked is nonzero, else there is no factor
to consider. A binary decision list to pick or not pick factors is
then created, and this can be trivially converted to a list of factors
via a parallel map
, followed by a parallel reduce
to get the final
result. The maximum number of factors is, of course, the maximum
number of bits in the factorial source number, or log_2(n)
for n!
.
In parallel with “unlimited width,” the total number of sequential
steps is O(log_2(n) + log_2(log_2(n))) = O(log_2(n * log_2(n)))
, so
this is still pretty efficient. When the hardware parallel width
limit is reached, subsequent computations are completed sequentially,
and runtime then eventually diminishes to O(n / c)
, where c
is the
initial speedup constant related to the limited parts that can be
computed in parallel.
Now, let’s put it all together in pseudo-code. Basically, the main new additional higher-order primitive we need here is a modified parallel reduce, designed to “preserve partials.” This is fairly easy to code up since what it comes down to is just an increment and a decrement in the index computation.
/* Thread subroutine: Run one thread of the parallel `reduce`
subroutine to reduce in-place, writing the `reduce` result to the
end of the parallel block rather than the beginning. Only each
individual block is reduced. You must then use a thread job
executor to reduce these results down to a single scalar.
The result of the reduction is stored in the last element of each
block of the vector.
PLEASE NOTE: We rewrite the vector in place as our working memory,
so pass a copy!
BLOCK_LEN must be a power of two. */
void ABSTRACT_REDUCE_END_BLOCK_THSUBR(THCtx *ctx, Scalar *work,
unsigned REAL_LEN) {
unsigned stride_1 = 1; /* stride - 1 */
unsigned stride_log2 = 1;
unsigned sd_half = 1; /* Half stride */
unsigned sub_len;
unsigned i = GET_GLOBAL_TH_IDX(ctx);
unsigned my_blk_len = BLOCK_LEN;
/* Depends on mode whether we do this at all. If reduce operates on
non-power-of-two input. */
CMOV((i >= (REAL_LEN & ~(BLOCK_LEN - 1))), unsigned, \
my_blk_len, REAL_LEN & (BLOCK_LEN - 1));
/* sub_len computation is reformulation of the following:
while (stride * (i + 1) - 1 < my_blk_len) { ... } */
-> reorganize with correct rounding ->
while ((sub_len = ((my_blk_len + 1 - stride) +
(stride - 1)) / stride) > 0)
{ ... }
->
while ((sub_len = my_blk_len / stride) > 0) { ... }
N.B. Since we divide strictly by powers of two, this can be
simplified to bit-shifting. Very important! */
while ((sub_len = my_blk_len >> stride_log2) > 0) {
CMOV((i < sub_len), Scalar, \
work[stride*i+stride_1], \
COMBINE(work[stride*i+stride_1], work[stride*i+sd_half]));
/* OpenCL barrier ensures that all threads are synced in execution
and local memory consistency state by this point, for the sake
of the sequential outer loop. */
barrier(CLK_LOCAL_MEM_FENCE);
stride_1 = stride_1 * 2 + 1; /* = (stride_1 + 1) * 2 - 1 */
stride_log2++;
sd_half *= 2;
}
}
/* Now, for the `REDUCE_ALL` subroutine. Unfortunately, due to
needing to keep partials, this must be a bit more complicated. We
need to gather the block-wise reduce results for higher-order
computation, then scatter them to store them. So this means that
we also need more local memory to handle this.
Fortunately, however, we can still make an optimization. We
actually only need to scatter to store partial results, but we can
still coalesce in place just like we do with the conventional
parallel `REDUCE_ALL`. */
void scatter_store_end_thsubr(THCtx *ctx, Scalar *out, Scalar *work,
unsigned REAL_LEN, unsigned stride)
{
unsigned i = GET_GLOBAL_TH_IDX(ctx);
/* Depends on mode whether we make this check, if it is possible
that we are not aligned to the block size. */
CMOV((i < REAL_LEN), Scalar, out[stride*i], work[i]);
}
/* Thread job executor for finish `reduce`, thread job executor on the
`reduce_end` concrete thread subroutine multiple times until there
is only one element remaining. Each time we `reduce_end`, we
scatter-store the partials to the destination memory.
BLOCK_LEN must be a power of two so that BLOCK_LEN_LOG2 can be used
for fast divide by shift right. */
void ABSTRACT_REDUCE_END_ALL(unsigned wait_list_len, cl_event *wait_list,
cl_event *out_event,
Scalar *out, Scalar *work, unsigned REAL_LEN)
{
cl_event wait_event, old_wait_event;
unsigned sub_len = REAL_LEN >> BLOCK_LEN_LOG2;
unsigned scatter_stride = BLOCK_LEN;
while (sub_len > 1) {
unsigned align_len = (sub_len + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
sched_th_job(align_len, BLOCK_LEN,
wait_list_len, wait_list, &wait_event,
coalesce_stride_thsubr, work, work, sub_len, BLOCK_LEN);
while (wait_list_len > 0) {
wait_list_len--;
clReleaseEvent(*wait_list++);
}
old_wait_event = wait_event;
/* Check if we need to add padding to fill `BLOCK_LEN`. */
if (sub_len != align_len) {
/* unsigned rem = sub_len & (BLOCK_LEN - 1);
unsigned frag = sub_len - rem; */
/* Ideally we'd compute on only the subset
`work[frag...frag+rem]`. But OpenCL 1.0 doesn't have
convenient syntax to support that, so just compute on
everything. PLEASE NOTE: OpenCL 1.1 added support for
sub-buffer objects, clCreateSubBuffer(). */
/* N.B.: This could possibly be faster on the CPU in a Unified
Memory Architecture but slower on non-unified architectures.
So it depends on mode whether we execute on the GPU. */
sched_th_job(align_len, BLOCK_LEN, 1, &old_wait_event, &wait_event,
pad_align_thsubr, work, sub_len, IDENT_VAL);
clReleaseEvent(old_wait_event);
old_wait_event = wait_event;
}
sched_th_job(align_len, BLOCK_LEN, 1, &old_wait_event, &wait_event,
ABSTRACT_REDUCE_END_BLOCK_THSUBR, work, sub_len);
clReleaseEvent(old_wait_event);
old_wait_event = wait_event;
/* Now scatter-store the partials to `out`. */
sched_th_job(align_len, BLOCK_LEN, 1, &old_wait_event, &wait_event,
scatter_store_end_thsubr, out, work, sub_len, scatter_stride);
clReleaseEvent(old_wait_event);
old_wait_event = wait_event;
sub_len >>= BLOCK_LEN_LOG2;
scatter_stride <<= BLOCK_LEN_LOG2;
}
*out_event = old_wait_event;
}
The abstract map-reduce routine above ABSTRACT_REDUCE_END_ALL()
is
almost exactly the same as in the conventional case. Basically, you
need to make sure you also allocate additional local memory scratch
space for work
and initialize it to be a copy of out
. The
scatter-store output goes straight to out
. For brevity, we will not
show it with the changes. Likewise, the changes for
ABSTRACT_MAP_REDUCE_END_BLOCK_THSUBR()
are extremely trivial: just
use ABSTRACT_REDUCE_END_BLOCK_THSUBR()
instead of
ABSTRACT_REDUCE_BLOCK_THSUBR()
.
Now, following the computation of partials via reduce_end
, we can
implement the map-reduce subroutine for computing the final sequence
of factorials.
/* Stage 2 is trivial, technically we have already showed the code for
it so we will not repeat it here. Generate a list of consecutive
integers, we can do this using the optimized `gpu_unfold_n()`
subroutine. */
/* Now, the final stage, stage 3. For one factorial input number,
collect all partial factors and multiply them together. These jobs
must be executed for each consecutive input integer in the
sequence, and, of course, as each job is independent, they can be
issued in parallel. */
#define FACTRAL_MAP(work, i) \
((num & (1 << i))) ? partials[num & ~((1 << i) - 1)] : 1
#define FACTRAL_COMBINE(b, c) (b * c)
CONCRETE_MAP_REDUCE(factral_stage3, FACTRAL_MAP, 1, FACTRAL_COMBINE);
#undef FACTRAL_MAP
#undef FACTRAL_COMBINE
/* Generated function prototypes: */
void factral_stage3_maybe_thsubr(Scalar *out, unsigned num);
void factral_stage3(Scalar *out, unsigned num);
Now… the stage 2 and stage 3 is indeed a bit weird, I must admit. The unfortunate fact is, thatthe code that schedules all these parallel jobs must run on the CPU, so you have a sequential bottleneck that probably far outweighs the gains from being able to reduce in parallel. Since we are, after all, working with partials, we have exponentially less factors to multiply together, so it might make more sense to use the sequential compute model formulation to multiply together the partials, then use GPU thread-level parallelism to process all such numbers in parallel. This also better ensures that the parallel threads are kept as busy as possible. Let’s show the code for that since the implementation is rather specialized.
/* Combine partials to generate a sequence of consecutive unfold
values, i.e. generat3e a factorial sequence from computed partials.
For the sake that this needs a bit of specialized optimization,
while at the same time it is fairly general for other types of
unfolds, we will define the thread subroutine straight for
combining partials computed from a parallel `map-reduce*`
(i.e. preserves partials).
What's going on here in higher-order vector functions?
Technically, this is an `unfold-unfold-map-reduce`. Since we are
using GPU optimization for unfolding consecutive integers, it looks
like a `map-map-reduce`. Since we run the interior
`unfold-map-reduce` sequentially, as a whole it looks like a `map`.
So in the end, it seemed simpler to just write the straight code
without using the higher-order code generation, except for the use
of defining the reduce generically by way of `IDENT_VAL` and
`COMBINE()`.
PLEASE NOTE: `partials` must be zero-based, i.e. the first result
of an `unfold` would be found at index 1 rather than index zero.
The item at index zero is the identity value defined for the
particular combinator. So, for example, `unfold` count by one
would be `{ 0, 1, 2, 3, ... }` and `unfold` power series of three
would be `{ 1, 3, 9, 27, ... }`. */
void combine_parts_thsubr(THCtx *ctx, Scalar *out, Scalar *partials,
unsigned REAL_LEN, unsigned bits_len)
{
unsigned i = GET_GLOBAL_TH_IDX(ctx);
unsigned j = bits_len;
/* Sequential compute model map-reduce inlined right here, run in
parallel for the width of `out`. */
Scalar accum = IDENT_VAL;
while (j > 0) {
unsigned bt; /* bit test */
j--;
bt = 1 << j;
Scalar part =
((i & bt)) ? partials[i & ~(bt - 1)] : 1;
accum = COMBINE(accum, part);
}
/* Depends on mode whether we make this check, if it is possible
that we are not aligned to the block size. */
CMOV((i < REAL_LEN), Scalar, out[i], accum);
}
Now that we have the factorial sequence figured out, let’s work on the big picture. The Taylor power series is defined as follows:
sum_{n = 0}^oo f^(n)(a) * (x - a)^n / n!
f^(n)(a)
is my notation for the nth derivative of function f(x)
,
evaluated at the value a
. When a = 0
, the simplified form is
called the Maclaurin series.
With all the pieces of the puzzle in place, we can now implement the power series computation kernel. This can be outlined in the following stages:
- Compute the derivative sequence
- Compute the power sequence (
unfold
) - Compute the sequence of factorials (
unfold-map-reduce*, unfold-map-reduce
) - Multiply-divide to combine these three factors (
map
) - Sum up the terms (
reduce
)
For computing the derivative sequence, the function f(x) = e^x
is an
easy special case because f'(x) = e^x
. So, computing the derivative
sequence is trivial: e^0 = 1
. Likewise, sine and cosine are easy
because their derivatives at angle zero cycle between zero and one.
So that can be computed in parallel using the following equation:
((-1)^n + 1) / 2
. Since this is readily a power sequence followed
by a map, a parallel unfold
can be used. Or… well if you’re a
programmer, this other alternative is obvious: n % 2 == n & 1
.
Using low-level GPU optimization magic, that can be computed fully in
parallel.
So, all that being said, let’s pick the easiest concrete example to
compute, e^x
. This makes the step to compute the derivative
sequence drop out completely since we compute as a Maclaurin series on
e^x
, which will be e^0 = 1
every single time. What we’ve
developed in bits and pieces, we can now put it all together and see
how it works as a whole.
#define MAX(b, c) (((b) > (c)) ? (b) : (c))
#define FACTRAL_MAP(work, i) work = MAX(1, i)
#define FACTRAL_COMBINE(b, c) (b * c)
CONCRETE_MAP_REDUCE_END(factral_gen_parts, FACTRAL_MAP, 1, \
FACTRAL_COMBINE);
CONCRETE_COMBINE_PARTIALS(factral_combine_parts, 1, FACTRAL_COMBINE);
#undef FACTRAL_MAP
#undef FACTRAL_COMBINE
/* Generated function prototypes: */
void factral_gen_parts_maybe_thsubr(Scalar *out, unsigned REAL_LEN);
void factral_gen_parts(Scalar *out, unsigned REAL_LEN);
void factral_combine_parts_maybe_thsubr(Scalar *out, Scalar *partials,
unsigned REAL_LEN, unsigned bits_len);
void factral_combine_parts(Scalar *out, Scalar *partials,
unsigned REAL_LEN, unsigned bits_len);
/* For posterity, a quick definition: Find the bit index of the most
significant bit, or (unsigned)-1 if the input is zero. Count
leading zeroes (`clz()` subroutine), then subtract the result from
the width in bits minus one. Essentially, this computes the base 2
logarithm of positive unsigned integers. */
unsigned char msbidx(unsigned b)
{
return 8 * sizeof(unsigned) - 1 - clz(b);
}
/* Now with all this code generation, it's a breeze to craft our
subroutine to generate a factorial sequence, starting at `0!` and
including `n!`. Results are returned in `out`, which must be
sufficiently allocated. */
void gen_factral_list_n(Scalar *out, unsigned n)
{
unsigned REAL_LEN = n + 1;
unsigned VEC_LEN = (REAL_LEN + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
unsigned bit_len = msbidx(REAL_LEN) + 1;
/* OpenCL ALLOCATE GLOBAL MEMORY. `xmalloc()` used only for
simplicity of demonstration. */
Scalar *work = (Scalar*)xmalloc(sizeof(Scalar) * VEC_LEN);
factral_gen_parts(work, REAL_LEN);
factral_combine_parts(out, work, REAL_LEN, bit_len);
/* OpenCL FREE GLOBAL MEMORY. */
xfree(work);
}
/* We can generate a power sequence in parallel with ease, thanks to
our `unfold` code generators. */
/* Parameterize the base value to a variable. */
#define GEN_POW_LIST_NZ_BASE base
#define GEN_POW_LIST_NZ_COMBINE(b, c) (b * c)
CONCRETE_UNFOLD(gen_pow_list_nz, \
GEN_POW_LIST_NZ_BASE, gen_pow_list_nz_COMBINE);
#undef GEN_POW_LIST_NZ_BASE
#undef GEN_POW_LIST_NZ_COMBINE
/* Generated function prototypes: */
void gen_pow_list_nz_blkone(Scalar *out, Scalar base);
void gen_pow_list_nz_maybe_thsubr(Scalar *out, Scalar *blkone,
unsigned REAL_LEN, Scalar base);
void gen_pow_list_nz(Scalar *out, Scalar *blkone, unsigned REAL_LEN,
Scalar base);
/* We just need one little helper subroutine to correctly set up a
zero-based array. And, well, to provide a fully high-level
programmer interface. Compute a power sequence of base, starting
at `base^0` and including `base^n`. Results are returned in `out`,
which must be sufficiently allocated. */
void gen_pow_list(Scalar *out, Scalar base, unsigned n)
{
unsigned REAL_LEN = n + 1;
/* OpenCL ALLOCATE GLOBAL MEMORY. `xmalloc()` used only for
simplicity of demonstration. */
Scalar *blkone = (Scalar*)xmalloc(sizeof(Scalar) * BLOCK_LEN);
gen_pow_list_nz_blkone(blkone, base);
/* N.B.: Use OpenCL 1.1 `clCreateSubBuffer()` as necessary. */
gen_pow_list_nz(&out[1], blkone, REAL_LEN, base);
/* Initialize out[0] as `base^0 = 1`. */
out[0] = 1;
/* OpenCL FREE GLOBAL MEMORY. */
xfree(blkone);
}
/* Finally, the last stage of the computation of `e^x` can be
implemented as a simple `map-reduce`. */
#define DIVSUM_MAP(work, i) work = b[i] / c[i]
#define DIVSUM_REDUCE(b, c) (b + c)
CONCRETE_MAP_REDUCE(divsum, DIVSUM_MAP, 0, DIVSUM_REDUCE);
#undef DIVSUM_MAP
#undef DIVSUM_REDUCE
/* Generated function prototypes: */
void divsum_maybe_thsubr(Scalar *out, Scalar *b, Scalar *c);
void divsum(Scalar *out, Scalar *b, Scalar *c);
/* Now for a little helper function to put it all together. Compute
`e^x` using `n + 1` power series iterations. */
Scalar pow_e_x(Scalar x, unsigned n)
{
Scalar result;
unsigned REAL_LEN = n + 1;
unsigned VEC_LEN = (REAL_LEN + BLOCK_LEN - 1) & ~(BLOCK_LEN - 1);
/* OpenCL ALLOCATE GLOBAL MEMORY. `xmalloc()` used only for
simplicity of demonstration. */
Scalar *pow_list = (Scalar*)xmalloc(sizeof(Scalar) * VEC_LEN);
Scalar *factral_list = (Scalar*)xmalloc(sizeof(Scalar) * VEC_LEN);
Scalar *work = (Scalar*)xmalloc(sizeof(Scalar) * VEC_LEN);
gen_pow_list(pow_list, x, n);
gen_factral_list_n(factral_list, n);
divsum(work, pow_list, factral_list);
/* OpenCL READ GLOBAL MEMORY BUFFER. */
result = work[0];
/* OpenCL FREE GLOBAL MEMORY. */
xfree(pow_list);
xfree(factral_list);
xfree(work);
return result;
}
Conclusion
An important observation of the path to implementing code is worth
reflecting on. The full optimization of combined operations
introduces a certain degree of complexity to the programmer, both for
sequential and parallel computation, especially when GPU memory
access constraints and optimizations also have to be considered.
Subroutines that behave as if they compute all scalars in parallel are
easiest to program with. However, in the case of unfold
and
reduce
, it is easier to program the actual body code of a sequential
compute model. This ease-of-programming fact is not to be
underestimated if you are writing new algorithms! Likewise, the
reason why much existing math code was written specialized to
assumptions that only work well for a particular application is
obvious: there are simply less things you need to keep simultaneously
track of in your head when implementing specialized code at the bottom
layer, and if the goal is to develop only a single application, why
bother generalizing?
It’s very easy to write a sequential compute model unfold
(generator
function), but rather challenging to write a parallel compute model
unfold
. But, in the end, it is very rewarding, because this is the
last piece of the puzzle you need to be able to get a reasonably fast
generalized parallel power series computation. That allows you to
evaluate many arbitrary functions with ease, to a high number of bits
of precision. Applying a few restrictions on what can be done by a
generator function and how the code is to be structured is likewise
very helpful for implementing parallel unfold
operations.
Now, the ad ultimum question: How does the performance compare against a sequential compute model alternative? Now, that is a question to answer for a different time. I have only thus yet presented the core concepts here in pseudo-code, there are tiny little changes I need to make all over to turn this into real, working code. But, the main idea of this discussion was to set the stage with some good ideas that would work well as guidance during my process of implementing the actual working code. And, might I mention this other idea, it would be great to venture into the world of Fourier series as that is extremely common in multimedia processing.
Obviously, simply putting together an ideal math subroutines library is not enough. Without connecting it with some baseline practical applications, it will never gain popularity on its own for other people to consider implementing other software on top of it. This is the prime reason why so much software does not use very good math libraries and implementation. Essentially, at least for a long time until recently, modern software development is a “race to the bottom”: whoever develops some crude software that showcases some function first wins. This is in total disregard to how much those short-term decisions may slow down long term future development. But because of the pressures of the historic software profession of dumping the software bloat problem onto the brute force increases of sequential compute speed in computer hardware to solve, it worked pretty well for a good 15 years.
That being said, the reason why quality software development is making a comeback is obvious. The brute force methodology of throwing more sequential compute speed at otherwise broken software has come to an end over a decade ago. In the meantime, it has become more than obvious that the incumbent players were failing to improve their software, and that has opened up ample opportunity for quality-focused competitors to take the stage. Not just to break even with the old software was a race to slap together, but to go further much faster and eclipse it in features and functions.