After a "brief" hiatus..

Parallel prefix/suffix operations is a family of operations that follow either this template:

x = x ⊕ (x >> 1); x = x ⊕ (x >> 2); x = x ⊕ (x >> 4); x = x ⊕ (x >> 8); x = x ⊕ (x >> 16);Or this template:

x = x ⊕ (x << 1); x = x ⊕ (x << 2); x = x ⊕ (x << 4); x = x ⊕ (x << 8); x = x ⊕ (x << 16);Where ⊕ is either OR or XOR. AND is pointless, but perhaps there are undiscovered useful variations.

There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) parallel prefix sum algorithm.

### Some members of the family

The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.

The parallel *suffix* with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.

The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since `PP-OR(low ^ high)`

gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on propagating bounds through bitwise operations.

This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as

lzcnt rax, rax ; input in rax sbb rdx, rdx not rdx shrx rax, rdx, raxOr:

lzcnt rax, rax mov edx, 64 sub edx, eax or rax, -1 bzhi rax, rax, rdxThe each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that

`popcnt`

, `tzcnt`

and `lzcnt`

all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann)
xor edx, edx // provide a zero register for cmov bsr ecx, eax mov eax, -1 not ecx // flags not modified cmovz eax, edx shr eax, cl

### Properties/structure

To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.

#### PS-XOR(x) is clmul(x, -1)

Since every step is a clmul by `1 + 2`

and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is ^{k}`x ^ (x << 1)`

.

It also inherits that `PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y)`

from the distributivity of clmul.

Since clmul is commutative and associative, the steps in PS-XOR can be reordered.

#### PS-OR(x) is or_mul(x, -1)

This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication:

static uint or_mul(uint a, uint b) { uint r = 0; for (int i = 0; i < 32; i++) { if ((a & 1) == 1) r |= b; // the only difference is here a >>= 1; b <<= 1; } return r; }It's not fancy math notation but you get the idea.

This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties:

`or_mul(a, b) == or_mul(b, a)`

commutivity`or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c)`

associativity`or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c)`

left distributivity`or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c)`

right distributivity

`PS-OR(x) == or_mul(x, -1)`

, and the individual steps of the PS-OR are of the form `x = or_mul(x, 1 + 2`

. The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).^{k})

And again, PS-OR inherits `PS-OR(x) | PS-OR(y) = PS-OR(x | y)`

from distributivity.

And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.

#### PS-OR(x) is also `x | -x`

This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case,
given that (in string notation) -(a10^{k}) = (~a)10^{k}; a10^{k} | -(a10^{k}) = (~a|a)10^{k} = 1^{∞}10^{k}, so it copies the lowest set bit into all higher bits, just as PS-OR does.