Bits, Math and Performance(?): Parallel prefix/suffix operations

After a "brief" hiatus..

Parallel prefix/suffix operations is a family of operations that follow either this template:

x = x ⊕ (x >> 1);
x = x ⊕ (x >> 2);
x = x ⊕ (x >> 4);
x = x ⊕ (x >> 8);
x = x ⊕ (x >> 16);

Or this template:

x = x ⊕ (x << 1);
x = x ⊕ (x << 2);
x = x ⊕ (x << 4);
x = x ⊕ (x << 8);
x = x ⊕ (x << 16);

Where ⊕ is typically OR or XOR. I've never seen ⊕=AND show up naturally, but it might be useful for something.

There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) parallel prefix sum algorithm.

Some members of the family

The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.

The parallel suffix with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.

The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since PP-OR(low ^ high) gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on propagating bounds through bitwise operations.
This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as

   lzcnt rax, rax  ; input in rax
   sbb rdx, rdx
   not rdx
   shrx rax, rdx, rax

Or:

   lzcnt rax, rax
   mov edx, 64
   sub edx, eax
   or rax, -1
   bzhi rax, rax, rdx

The each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that popcnt, tzcnt and lzcnt all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann)

   xor edx, edx    // provide a zero register for cmov
   bsr ecx, eax
   mov eax, -1
   not ecx         // flags not modified
   cmovz eax, edx
   shr eax, cl

Properties/structure

To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.

PS-XOR(x) is clmul(x, -1)

Since every step is a clmul by 1 + 2^k and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is x ^ (x << 1).

It also inherits that PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y) from the distributivity of clmul.

Since clmul is commutative and associative, the steps in PS-XOR can be reordered.

PS-OR(x) is or_mul(x, -1)

This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication:

static uint or_mul(uint a, uint b)
{
    uint r = 0;
    for (int i = 0; i < 32; i++)
    {
        if ((a & 1) == 1)
            r |= b; // the only difference is here
        a >>= 1;
        b <<= 1;
    }
    return r;
}

It's not fancy math notation but you get the idea.

This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties:

or_mul(a, b) == or_mul(b, a) commutivity
or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c) associativity
or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c) left distributivity
or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c) right distributivity

But, of course, no multiplicative inverses. Except when multiplying by 1, but that doesn't count since it's the multiplicative identity.

PS-OR(x) == or_mul(x, -1), and the individual steps of the PS-OR are of the form x = or_mul(x, 1 + 2^k). The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).

And again, PS-OR inherits PS-OR(x) | PS-OR(y) = PS-OR(x | y) from distributivity.

And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.

PS-OR(x) is also `x | -x`

This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case, given that (in string notation) -(a10^k) = (~a)10^k; a10^k | -(a10^k) = (~a|a)10^k = 1^∞10^k, so it copies the lowest set bit into all higher bits, just as PS-OR does.

Thursday 24 November 2016

Parallel prefix/suffix operations

Some members of the family

Properties/structure

PS-XOR(x) is clmul(x, -1)

PS-OR(x) is or_mul(x, -1)

PS-OR(x) is also x | -x

PS-OR(x) is also `x | -x`