Bits, Math and Performance(?): 2013

Saturday 28 September 2013

Determining the cardinality of a set described by a mask and bounds

In other words, calculating this \begin{equation} | \left\{ x | (x \& m) = v \land x \ge a \land x \le b \right\} | \end{equation}

A BDD (which haroldbot uses to solve this) has no trouble with that at all, but the structure the problem is so nice that I thought it should be possible to do better. And it is, though the solution I'm about to present is probably far from optimal. I don't know how to significantly improve on it, but I just get a very non-scientific gut-feeling that there should be a fundamentally better way to do it.

The cardinality of the set without the bounds (only the mask) is obviously trivial to compute (as 1 << popcnt(~m)), and the bounds divide that set in three parts: items that are lower than the lower bound ("left part"), items that are higher than the upper bound ("right part")), and items that are actual set of interest ("middle part"). The main idea it is built on is that it's relatively easy to solve the problem if it didn't have a lower bound and the upper bound was a power of two, with a formula roughly similar to the one above. Using that, all the powers of two that fit before lower bound can be counted from high to low, giving the cardinality of the "left part". The same can be done for the "right part", actually in exactly the same way, by complementing the mask and the upper bound. Obviously the cardinality of the "middle part" can be computed from this.

And here's the code. It doesn't like it when the cardinality is 2³², and watch out for weird corner cases such as when the lower bound is bigger than the upper bound (why would you even do that?). It usually works, that's about as much as I can say - I didn't prove it correct.

static uint Cardinality(uint m, uint v, uint a, uint b)
{
    // count x such that a <= x <= b && (x & m) == v
    // split in three parts:
    // left  = 0 <= x < a
    // right = b < x <= -1
    // the piece in the middle is (1 << popcnt(~m)) - left - right
    uint left = 0;
    uint x = 0;
    for (int i = 32 - 1; i >= 0; i--)
    {
        uint mask = 1u << i;
        if (x + mask <= a)
        {
            x += mask;
            uint mask2 = 0 - mask;
            if ((x - 1 & m & mask2) == (v & mask2))
            {                        
                uint amount = 1u << popcnt(~(m | mask2));
                left += amount;
            }
        }
    }
    uint right = 0;
    uint y = 0;
    for (int i = 32 - 1; i >= 0; i--)
    {
        uint mask = 1u << i;
        if (y + mask <= ~b)
        {
            y += mask;
            uint mask2 = 0 - mask;
            if ((y - 1 & m & mask2) == (~v & m & mask2))
            {
                uint amount = 1u << popcnt(~(m | mask2));
                right += amount;
            }
        }
    }
    uint res = (uint)((1UL << popcnt(~m)) - (left + right));
    return res;
}

The loops can be merged of course, but for clarity they're separate here.

If you have any improvements, please let me know.

Saturday 10 August 2013

Announcing haroldbot

haroldbot was an ircbot (hence the name) that solves some bitmath problems. The title is actually a lie - haroldbot has been around for a while now. But now it finally got its own website.
haroldbot.nl
Check it out, if you work with bits you will probably find this useful.

Thursday 30 May 2013

Carryless multiplicative inverse

Note: this post is neither about the normal multiplicative inverse, nor the modular multiplicative inverse. This other post has information about the modular multiplicative inverse, which might be what you were looking for.

Mathematicians may call carryless multiplication "multiplication in GF(2^n)", but that doesn't explain how it works - recall the shift-and-add algorithm for multiplication:

static uint mul(uint a, uint b)
{
    uint r = 0;
    while (b != 0)
    {
        if ((a & 1) != 0)
            r += b;
        a >>= 1;
        b <<= 1;
    }
    return r;
}

Carryless multiplication is a very simple variation on that: do the addition without carries. That's just a XOR.

static uint cl_mul(uint a, uint b)
{
    uint r = 0;
    while (b != 0)
    {
        if ((a & 1) != 0)
            r ^= b;      // carryless addition is xor
        a >>= 1;
        b <<= 1;
    }
    return r;
}

It has some applications in complicated cryptography related algorithms, but it also seems like this should be an interesting and powerful operation when working with bits, and it may well be, but I know of almost no uses for it (~~besides, Intel's implementation is so slow that it often wouldn't help~~ Intel made it over twice as fast in Haswell). But anyway, let's just start with its basic properties: like normal multiplication, it's commutative and associative. It's also distributive, but over xor instead of over addition. None of this is very surprising.

As an aside, using associativity, it can be shown that the parallel suffix with XOR (which does have some known uses in bitmath, for example in implementing compress_right in software), code shown below, is equivalent to a carryless multiplication by -1.

// parallel suffix with XOR
x ^= x << 1;
x ^= x << 2;
x ^= x << 4;
x ^= x << 8;
x ^= x << 16;

Every step is clearly a carryless multiplication, by 3, 5, 17, 257, and 65537 respectively. So it's equivalent to:
clmul(clmul(clmul(clmul(clmul(x, 3), 5), 17), 257), 65537) which can be rearranged (using associativity) to:
clmul(x, clmul(clmul(clmul(clmul(3, 5), 17), 257), 65537)) which works out to clmul(x, -1). Of course it was supposed to work out that way, because every bit of the result should be the XOR of all bits up to (and including) that bit, but it's nice that it also follows easily from a basic property. Incidentally if you have a full-width carryless multiplication, multiplying by -1 also computes the parallel prefix with XOR in the upper bits (the upper bit of the low word, which is the parity of the input, is shared by the suffix and the prefix.)

Carryless multiplication also shares an other property with normal multiplication: there are multiplicative inverses modulo 2ⁿ (and also modulo other numbers, but 2ⁿ is of particular interest since we're working in that by default anyway). Again there are only inverses for odd numbers, and it's equally obvious (as for normal multiplication) why that should be so - an even multiplier will throw out at least one high order bit. First, here's an example of carrlessly multiplying x by -1 and then carrylessly multiplying that by 3.

x = {d}{c}{b}{a}    // the letters are bits
y = cl_mul(x, -1) = {d^c^b^a}{c^b^a}{b^a}{a}
z = cl_mulinv(-1) = 0011
cl_mul(y, z) = {d^c^b^a ^ c^b^a}{c^b^a ^ b^a}{b^a ^ a}{a} = {d}{c}{b}{a}

Ok, so that worked out well, and it also gives part the answer to exercise 3 in chapter 5 of Hacker's Delight (about whether parallel prefix/suffix with XOR is invertible and how) because a carryless multiplication by -1 is the same as the parallel suffix with XOR. A carryless multiplication of y by 3 is of course just y ^ (y << 1).

But back to actually computing the inverse. The inverse had better be odd, so bit 0 is already known, and for all the other bits follow these steps

if the remainder is 1, stop
if bit k is 0, go to step 5
set bit k in the inverse
xor the remainder with input << k
increment k and go to step 1

Step 4 always resets the offending bit because the input had to be odd, so it's obvious that the remainder always ends up being 1 eventually, and so the algorithm always terminates. Moreover, even in the worst case it only has to process every bit but one, and continuing after the remainder becomes 1 simply does nothing, so step 1 could read "if k is 32" (or some other number, depending on how many bits your ints are wide), which is easier to unroll and better suited for a hardware implementation (not that I've seen this operation implemented in hardware anywhere).
For example, in C# it could look like this:

static uint clmulinv(uint x)
{
    uint inv = 1;
    uint rem = x;
    for (int i = 1; i < 32; i++)
    {
        if (((rem >> i) & 1) != 0)
        {
            rem ^= x << i;
            inv |= 1u << i;
        }
    }
    return inv;
}

A variation of the algorithm to find a multiplicative inverse modulo a power of two (see inv here) also works, which is useful when clmul is fast:

static uint clmulinv(uint d)
{
    uint x = 1;
    for (int i = 0; i < 5; i++)
    {
        x = clmul(x, clmul(x, d));
    }
    return x;
}

The first iteration sets x to d, that can be done immediately to skip an iteration.

Some sample inverses

1  0x00000001
3  0xFFFFFFFF
5  0x55555555
7  0xDB6DB6DB
9  0x49249249
11 0x72E5CB97
13 0xD3A74E9D
15 0x33333333

The definition of clmul at the start of the post was meant to be just that, a faster way to emulate it is this:

static uint clmul(uint a, uint b)
{
    uint r = 0;
    do
    {
        r ^= a * (b & (0 - b));
        b &= b - 1;
        r ^= a * (b & (0 - b));
        b &= b - 1;
        r ^= a * (b & (0 - b));
        b &= b - 1;
        r ^= a * (b & (0 - b));
        b &= b - 1;
    } while (b != 0);
    return r;
}

This works by extracting a bit from b and multiplying by it (which just shifts a left), then resetting that bit. This can be unrolled safely since when b == 0, no further changes are made to r automatically. The 0 - b thing is due to an unfortunate misfeature of C#, negating unsigned integers converts it to a long.

A similar trick works for the inverse:

static uint clmulinv(uint x)
{
    uint inv = 1;
    uint rem = x - 1;
    while (rem != 0)
    {
        uint m = rem & (0 - rem);
        rem ^= x * m;
        inv += m;
    }
    return inv;
}

By the way, why is this post so popular? Please let me know in the comments down below.

Thursday 11 April 2013

Improving bounds when some bits have a known value

This problem is closely related the series of problems discussed in calculating the lower bound of the bitwise OR of two bounded variables (and some of the posts after that one), and the algorithm is very closely related, too. The question in this post is, suppose some bits may be known to be zero and some may be known to be one, is there a better lower/upper bound than the given one, and if so, what is it? That is, calculate \begin{equation} \min _{x \in [a, b] \wedge (x | \sim z) = x \wedge (x \& \sim o) = 0 } x \end{equation} and \begin{equation} \max _{x \in [a, b] \wedge (x | \sim z) = x \wedge (x \& \sim o) = 0 } x \end{equation} where z is a bitmask containing the bits that are allowed to be 0, and o is a bitmask containing the bits that are allowed to be 1.

The idea behind the algorithms is to do a binary search (the one-sided variant) over the numbers that the masks allow, for the lowest value bigger than or equal to the original lower bound (or smaller than or equal to the original upper bound, for the new upper bound). Just as in the case of propagating bounds through XOR, it may take more than one step, so there aren't many shortcuts. I called them both "reduce" even though ReduceMin actually increases the value, because their purpose is to reduce the range [min, max].

static uint ReduceMin(uint min, uint z, uint o)
{
    uint mask = z & o;   // note: mask is a subset of r
    uint r = o;
    uint m = 0x80000000 >> nlz(mask);
    while (m != 0)
    {
        // reset the bit if it can be freely chosen
        uint newr = r ^ (m & mask);
        if (newr >= min)
            // keep the change if still within bounds
            r = newr;
        m >>= 1;
    }
    return r;
}

static uint ReduceMax(uint max, uint z, uint o)
{
    uint mask = z & o;
    uint r = ~z;
    uint m = 0x80000000 >> nlz(mask);
    while (m != 0)
    {
        // set the bit if it can be freely chosen
        uint newr = r | (m & mask);
        if (newr <= max)
            // keep the change if still within bounds
            r = newr;
        m >>= 1;
    }
    return r;
}

There is one shortcut (that I know of): using nlz on every iteration, thereby skipping iterations where the current bit isn't even changed. With the implementation of nlz I was working with, that wasn't worth it, so whether it's actually a real shortcut or not is up for debate.

Occasionally the new lower bound can be higher than the new upper bound, that means the set of values was actually empty. If you were working with clockwise intervals, that changes to "if the new bounds aren't ordered the same way as the old ones" - ie if the interval was proper and the new one isn't or vice versa, the set is empty.