Thursday 29 November 2012

Tesseral arithmetic - useful snippets

This post doesn't introduce anything new, and is, in my opinion, boring. Feel free to skip.

My previous post didn't have too many useful snippets in it (mainly useful techniques to make your own snippets), and I thought I could improve on that. This post is not a good read in isolation - it's probably a good idea to read my previous post first, if you haven't already.

Tesseral addition (see previous post) was nice, but very often you only need to increment/decrement one dimension of a coordinate (for example when iterating over a portion of a Z-ordered grid in reading order), equivalent to adding/subtracting (1, 0) or (0, 1) to/from a coordinate. Since only one part of the coordinate changes, only about half as much code is necessary. Also, since the thing being added to the coordinate is a constant, one of the masking operations can be merged with it.

static uint IncX(uint z)
{
    uint xsum = (z | 0xAAAAAAAA) + 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint IncY(uint z)
{
    uint ysum = (z | 0x55555555) + 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

static uint DecX(uint z)
{
    uint xsum = (z & 0x55555555) - 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint DecY(uint z)
{
    uint ysum = (z & 0xAAAAAAAA) - 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

My previous post only had TesseralMin, not the corresponding TesseralMax, so here you go:

public static uint TesseralMax(uint z, uint w)
{
    uint xdiff = (z & 0x55555555) - (w & 0x55555555);
    uint ydiff = (z >> 1 & 0x55555555) - (w >> 1 & 0x55555555);
    uint maskx = (uint)((int)xdiff >> 31);
    uint masky = (uint)((int)ydiff >> 31);
    uint xmin = (~maskx & z) | (maskx & w);
    uint ymin = (~masky & z) | (masky & w);
    return new T((xmin & 0x55555555) | (ymin & 0xAAAAAAAA));
}

Note that the only difference is that the mask and the complemented mask have switched places.

This TesseralMax and the TesseralMin from the previous post can be combined with the increments and decrements (and with full tesseral addition, but that's less frequently useful) to form saturating increments and decrements, useful for sampling around a position on a Z-ordered grid without getting out of bounds.

static uint IncXSat(uint z, uint xmax)
{
    uint xsum = ((z | 0xAAAAAAAA) + 1) & 0x55555555;
    uint xdiff = xsum - xmax;
    uint maskx = (uint)((int)xdiff << 1 >> 31);
    uint xsat = (maskx & xsum) | (~maskx & xmax);
    return xsat | (z & 0xAAAAAAAA);
}

static uint IncYSat(uint z, uint ymax)
{
    uint ysum = ((z | 0x55555555) + 2) & 0xAAAAAAAA;
    uint ydiff = ysum - ymax;
    uint masky = (uint)((int)ydiff >> 31);
    uint ysat = (masky & ysum) | (~masky & ymax);
    return ysat | (z & 0x55555555);
}

static uint DecXSat(uint z, uint xmin)
{
    uint xsum = ((z & 0x55555555) - 1) & 0x55555555;
    uint xdiff = xsum - xmin;
    uint maskx = (uint)((int)xdiff << 1 >> 31);
    uint xsat = (~maskx & xsum) | (maskx & xmin);
    return xsat | (z & 0xAAAAAAAA);
}

static uint DecYSat(uint z, uint ymin)
{
    uint ysum = ((z & 0xAAAAAAAA) - 2) & 0xAAAAAAAA;
    uint ydiff = ysum - ymin;
    uint masky = (uint)((int)ydiff >> 31);
    uint ysat = (~masky & ysum) | (masky & ymin);
    return ysat | (z & 0x55555555);
}

Merging them this way is nice, because only "half" of a TesseralMin or TesseralMax is necessary that way. On the other hand, they do have the overflow problem again, though that usually won't be a problem.

Next time, back to "stuff with bounds".

Tuesday 27 November 2012

Tesseral arithmetic

Introductions are boring, feel free to skip to the interesting stuff

Tesseral arithmetic is a type of arithmetic that operates on interleaved coordinates. That may not seem very useful, so first, when would you want to do that?

The Z-order curve is a space-filling curve (also known as Morton order, Morton coordinates, etc) that is closely related to quad trees (and octrees) and (in some contexts) improves the locality of reference when working with multidimensional data.

In essence, it maps multidimensional coordinates to single-dimensional coordinates, which can be used to address memory, and it does so in a way that sometimes leads to better locality of reference than concatenating the parts of a coordinate into a longer one. The trick is to interleave the bits. While that is not the best (ie. optimal locality of reference) mapping, but it's interesting that it works so well for such a simple trick.

But where it really gets interesting is when you have interleaved coordinates and you want to do math with them. You could unpack them, do your math, and then repack, but if you follow the previous link you can see that while unpacking and packing are simple and fast relative to the mappings of other space-filling curves, unpacking and packing would add a lot of overhead to what would otherwise be simple math.

That's where tesseral arithmetic comes in.

Bitwise AND, OR and XOR still work the same way, because the bits of the result only depend on the corresponding bits in the inputs. Shifts are simple - the shift count must be multiplied by two. So for example x ^ (x << 1) becomes x ^ (x << 2) in tesseral arithmetic.

Addition is more trouble. The carries in normal addition propagate into bits they shouldn't be affecting in tesseral arithmetic. But consider what would happen if the bit pairs at odd positions would each sum to 1. A carry coming into an odd position would always be passed on, and no extra carries would be generated from odd positions. So if the bits at odd positions are just right, the bits at the even positions are summed tesserally, with the carry moving two places instead of one. Obviously this extends to the odd bits as well, when the bits at even positions are just right. This actually makes tesseral addition quite simple:

static uint TesseralAdd(uint z, uint w)
{
    uint xsum = (z | 0xAAAAAAAA) + (w & 0x55555555);
    uint ysum = (z | 0x55555555) + (w & 0xAAAAAAAA);
    return (xsum & 0x55555555) | (ysum & 0xAAAAAAAA);
}

Unsurprisingly, the same principle applies to subtraction. In subtraction, borrows are passed on unmodified through a pair of bits if they sum to zero, or in other words, if both are zero. In a way that's conceptually even simpler than addition.

static uint TesseralSubtract(uint z, uint w)
{
    uint xdiff = (z & 0x55555555) - (w & 0x55555555);
    uint ydiff = (z & 0xAAAAAAAA) - (w & 0xAAAAAAAA);
    return (xdiff & 0x55555555) | (ydiff & 0xAAAAAAAA);
}

But multiplication isn't that nice. The problem is that multiplication is basically build out of a lot of shifts and additions (it's not implemented that way in hardware anymore) and the additions aren't tesseral nor can they be made tesseral.
Unless, of course, we implement multiplication in software:

static uint TesseralMultiply(uint z, uint w)
{
    uint x = z & 0x55555555;
    uint y = w & 0x55555555;
    uint xres = 0;
    while (x != 0)
    {
        if ((x & 1) != 0)
            xres = (xres | 0xAAAAAAAA) + y;
        y <<= 2;
        x >>= 2;
    }

    x = z & 0xAAAAAAAA;
    y = w & 0xAAAAAAAA;
    uint yres = 0;
    while (x != 0)
    {
        if ((x & 2) != 0)
            yres = (yres | 0x55555555) + y;
        y <<= 2;
        x >>= 2;
    }

    return (xres & 0x55555555) | (yres & 0xAAAAAAAA);
}

But that doesn't achieve the goal of being faster than unpacking, doing math, and repacking. If anyone has a better idea, please let me know.

So ok, no tricks multiplication or division. But we're not done. As I hinted in my previous post, many bitwise tricks extend to tesseral arithmetic. For example, taking the absolute value of both parts of the coordinate simultaneously, using the same trick as in my previous post (working with the signbit). The basic principle is simple: replace all operations by their tesseral counterparts. Then look for simplifications and other improvements.

static uint TesseralAbs(uint z)
{
    uint maskx = (uint)((int)z << 1 >> 31);
    uint masky = (uint)((int)z >> 31);

    // this is a simplified tesseral addition (followed by a xor)
    uint xabs = (z & 0x55555555) + maskx ^ maskx;
    uint yabs = (z & 0xAAAAAAAA) + masky ^ masky;

    return (xabs & 0x55555555) | (yabs & 0xAAAAAAAA);
}

The mask is known to be either all ones or all zeroes. It may seem at first as though that means we'd have to OR it with something to make the "in between" bits sum to one, but when the mask is zero there are no carries to pass on anyway. So the OR can be skipped.

But calculating absolute values of coordinates doesn't happen that often. So let's calculate an element-wise minimum, using the same basic principle as before, replace normal operators by tesseral operators. This time however, a substantial improvement over the non-tesseral version is possible.

static uint TesseralMin(uint z, uint w)
{
    // these are tesseral subtractions, of course
    uint xdiff = (z & 0x55555555) - (w & 0x55555555);
    uint ydiff = (z >> 1 & 0x55555555) - (w >> 1 & 0x55555555);

    uint maskx = (uint)((int)xdiff >> 31);
    uint masky = (uint)((int)ydiff >> 31);

    uint xmin = (maskx & z) | (~maskx & w);
    uint ymin = (masky & z) | (~masky & w);

    return (xmin & 0x55555555) | (ymin & 0xAAAAAAAA);
}

And there's something very nice about how that worked out. In the normal min, there was a problem with overflow. That doesn't happen here, because for xdiff there was an extra bit anyway, and for ydiff that extra bit could easily be arranged by shifting right by 1. That makes the comparison unsigned, though, because the "extra bit" is zero, not a sign-extended bit.

So that's it for this post. Many other bitwise tricks can be extended to tesseral math, using the same basic principle. And of course this all generalizes to higher dimensions as well.

In the next post, I'll have some more useful snippets for tesseral arithmetic.

There are some other references for this type of arithmetic or its generalizations, for example The Art of Computer Programming volume 4A which calls this "working with fragmented fields" and Morton-order Matrices Deserve Compilers’ Support which calls this the "algebra of dilated integers".

By the way I originally wrote this post thanks to (or maybe due to?) this article, which I found by searching how to do coordinate arithmetic in a quad tree with Morton order. That's where the title comes from. Unfortunately the article didn't really say how to actually do it, so I worked that out (though the algebra of dilated integers had been explored before, I did not know it went by that name) and posted it for the benefit of other people who perhaps traversed the same steps up to that point.

Sunday 18 November 2012

The basics of working with the signbit

this is a filler (in that it is much easier than the usual material), but it seems like most readers only read the fillers anyway

When I write signbit, I mean the upper bit in a bit string that is interpreted as a two's complement signed integer.

Central to working with the signbit is the idea that signed shift right aka arithmetic shift right copies the signbit to other bits, and specifically, a signed shift right by 31 (or 63 or in general, one less than the size of your numbers) broadcasts the signbit to all other bits.

Perhaps the most obvious thing you can do with that is broadcasting an arbitrary bit to all other bits. Simply shift that bit into the signbit, and then shift right by 31:

static int broadcastbit(int value, int bitindex)
{
    // put the target bit in the sign
    int temp = value << (31 - bitindex);
    // copy it to all bits
    return temp >> 31;
}

In C, that's undefined behaviour (UB). Letting a left shift overflow (which could easily happen here) is UB, and signed right shift is UB in any case. But this is C# code (the source of this page will tell you so) where it's perfectly well-defined. And anyway, this is the kind of UB that is safe to use; the expected thing happens when you combine a sane compiler with a typical platform (say, MSVC on x86). But, of course, purists won't like it and on platforms without arithmetic right shift it's probably not going to work.

That actually applies to most of this blog, I suppose.

On to other tricks. This one is slightly harder to grasp, but more useful: calculating the absolute value of an integer without branching. First, the simple to understand version.

static int abs(int value)
{
    // make a mask that is all ones if negative, or all zeroes if non-negative
    int mask = value >> 31;
    // select -value if negative, or value if non-negative
    return (mask & -value) | (~mask & value);
}

That's just the usual branchless selection between two things.

The better way to do this has to do with how negation works. The negation of a number x is ~x + 1 (first definition) or ~(x - 1) (second definition). Those definitions are, of course, equivalent. The trick (and you may have seen this coming), is to make the complement and the increment/decrement conditional based on the mask.

static int abs(int value)
{
    // make a mask that is all ones if negative, or all zeroes if non-negative
    int mask = value >> 31;
    // conditionally complement and subtract -1 (first definition)
    return (value ^ mask) - mask;
    // conditionally add -1 and complement (second definition)
    return (value + mask) ^ mask;
}

I've heard that the version of abs using the first definition is patented. That probably doesn't hold up (there will be a mountain of prior art and it's an obvious trick that anyone could derive), and no one's going to find out you're using it much less sue you for it, but you could use the version using the second definition just to be on the safe side.

One good thing about the simple version of abs is that it's using a generic branchless selection. That means you're not limited to choosing between value and -value, you can select anything. For example, you can subtract two numbers and use the sign of the difference to select the (unsigned) smallest one. That doesn't always work. The subtraction must not overflow, otherwise it selects the wrong one. The problem goes away if the inputs are smaller than ints, for example if they are bytes.

static byte min(byte x, byte y)
{
    int difference = x - y;
    // make a mask that is all ones if x < y, or all zeroes if x >= y
    int mask = difference >> 31;
    // select x if x < y, or y if x >= y
    return (byte)((mask & x) | (~mask & y));
    // alternative: use arithmetic to select the minimum
    return (byte)(y + (difference & mask));
}

The weird mixing of signed and unsigned may be confusing. Try to think of numbers as pure bit strings and only look at the type when an operator depends on it. That's closer to what actually happens in a computer, and it's less confusing that way.

The problem also goes away if you can use the carry flag instead of the signbit, because then you're not using a bit of the result to hold a flag but a separate thing, and thus doesn't "eat into the range of values". But high level languages are too good for the carry flag or something like that, and don't enable you to use it. So here's min in x86 assembly:

    ; inputs are in eax and edx, result in eax
    sub eax, edx
    sbb ecx, ecx    ; makes ecx all ones if carry (ie. if eax < edx)
    and eax, ecx
    add eax, edx

Whether this or the more usual branchless version with cmov is faster depends on the processor.

And that has nothing to do with the signbit anymore, I know.

These tricks, and many others, also extend to tesseral arithmetic, which I'll cover in my next post, which isn't a filler.

Sunday 16 September 2012

Calculating the lower and upper bound of the bitwise OR of two variables that are bounded and may have bits known to be zero

This new problem clearly is related to two of my previous posts. But this time, there is slightly more information. It may look like a contrived, purely theoretical, problem, but it actually has applications in abstract interpretation. Static knowledge about the values that variables could have at runtime often takes the form of a range and a number that the variable is known to be a multiple of, which is most commonly a power of two.

The lower bound will be \begin{equation} \min _{x \in [a, b] \wedge m\backslash x, y \in [c, d] \wedge n\backslash y} x | y \end{equation} And the upper bound will be \begin{equation} \max _{x \in [a, b] \wedge m\backslash x, y \in [c, d] \wedge n\backslash y} x | y \end{equation} Where m\x means "x is divisible by m".

So how can we calculate them faster than direct evaluation? I don't know, and to my knowledge, no one else does either. But if sound (ie only overapproximating) but non-tight bounds are OK, then there is a way. Part of the trick is constraining m and n to be powers of two. It's safe to use m = m & -m. That should look familiar - it's extracting the rightmost bit of m. An other explanation of "the rightmost bit of m" is "the highest power of two that divides m". That doesn't rule out any values of x that were valid before, so it's a sound approximation.

Strangely, for minOR, if the bounds are pre-rounded to their corresponding powers of two, there is absolutely no difference in the code whatsoever. It is possible to set a bit that is known to be zero in that bound, but that can only happen if that bit is one in the other bound anyway, so it doesn't affect the result. The other case, setting a bit that is not known to be zero, is the same as it would be with only the range information.

maxOR is a problem though. In maxOR, bits at the right are set which may be known to be zero. Some of those bits may have to be reset. But how many? To avoid resetting too many bits, we have to round the result down to a multiple of min(m, n). That's clearly sound - if a bit can't be one in both x and n, obviously it can't be one in the result. But it turns out not to be tight - for example for [8, 9] 1\x and [0, 8] 4\y, it computes 0b1111, even though the last two bits can only be 0b00 or 0b01 (y does not contribute to these bits, and the range of x is so small that the bits only have those values) so the tight upper bound is 0b1101. If that's acceptable, the code would be

static uint maxOR(uint a, uint b, uint c, uint d, uint m, uint n)
{
    uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint resettable = b & d & (resettableb | resettabled);
    uint target = resettable == 0 ? 0 : 1u << bsr(resettable);
    uint targetb = target & resettableb;
    uint targetd = target & resettabled & ~resettableb;
    uint newb = b | (targetb == 0 ? 0 : targetb - 1);
    uint newd = d | (targetd == 0 ? 0 : targetd - 1);
    uint mask = (m | n) & (0 - (m | n));
    return (newb | newd) & (0 - mask);
}

Which also uses a sneaky way of getting min(m, n) - by ORing them and then taking the rightmost bit. Because why not.

I haven't (yet?) found a nice way to calculate the tight upper bound. Even if I do, that still leaves things non-tight when the old m or n were not powers of two.

Friday 14 September 2012

Calculating the lower and upper bounds of the bitwise AND of two bounded variables

This post is the closely related the previous post and the post before it, so I strongly suggest you read those two first.

It's the same idea as before, but with bitwise AND instead of OR. That leads to some interesting symmetries. First, the definitions. The lower bound will be \begin{equation} \min _{x \in [a, b], y \in [c, d]} x \& y \end{equation} And the upper bound will be \begin{equation} \max _{x \in [a, b], y \in [c, d]} x \& y \end{equation} The algorithms given by Warren are

unsigned minAND(unsigned a, unsigned b, 
                unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (~a & ~c & m) {
         temp = (a | m) & -m; 
         if (temp <= b) {a = temp; break;} 
         temp = (c | m) & -m; 
         if (temp <= d) {c = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return a & c; 
}

unsigned maxAND(unsigned a, unsigned b, 
                unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (b & ~d & m) {
         temp = (b & ~m) | (m - 1); 
         if (temp >= a) {b = temp; break;} 
      } 
      else if (~b & d & m) {
         temp = (d & ~m) | (m - 1); 
         if (temp >= c) {d = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return b & d; 
}

Obviously, they follow the same basic idea. Try to set a bit so you can reset the bits to the right of it in the lower bound, or try to reset a bit so you can set the bits to the right of it in the upper bound. The same reasoning about starting at 0x80000000 >> nlz(~a & ~c) or 0x80000000 >> nlz(b ^ d) applies, and the same reasoning about "bits at and to the right of a ^ b" applies as well. I'll skip the "sparse loops" this time, they're nice enough but mainly instructive, and repeating the same idea twice doesn't make it twice as instructive. So straight to the loopless algorithms:

static uint minAND(uint a, uint b, uint c, uint d)
{
    uint settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint settable = ~a & ~c & (settablea | settablec);
    uint target = settable == 0 ? 0 : 1u << bsr(settable);
    uint targeta = target & settablea;
    uint targetc = target & settablec & ~settablea;
    uint newa = a & (targeta == 0 ? 0xFFFFFFFF : 0-targeta);
    uint newc = c & (targetc == 0 ? 0xFFFFFFFF : 0-targetc);
    return newa & newc;
}

static uint maxAND(uint a, uint b, uint c, uint d)
{
    uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint candidatebitsb = b & ~d & resettableb;
    uint candidatebitsd = ~b & d & resettabled;
    uint candidatebits = candidatebitsb | candidatebitsd;
    uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);
    uint targetb = target & b;
    uint targetd = target & d & ~b;
    uint newb = b | (targetb == 0 ? 0 : targetb - 1);
    uint newd = d | (targetd == 0 ? 0 : targetd - 1);
    return newb & newd;

Symmetry everywhere. But not really anything to new to explain.

Next post, something new to explain.

Calculating the upper bound of the bitwise OR of two bounded variables

This post is the closely related the previous one, so I strongly suggest you read that one first.

The only difference with the previous post, is that this time, we're interested in the upper bound instead of the lower bound. In other words, evaluate
\begin{equation} \max _{x \in [a, b], y \in [c, d]} x | y \end{equation} The algorithm given by Warren in Hackers Delight is

unsigned maxOR(unsigned a, unsigned b, 
               unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (b & d & m) {
         temp = (b - m) | (m - 1); 
         if (temp >= a) {b = temp; break;} 
         temp = (d - m) | (m - 1); 
         if (temp >= c) {d = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return b | d; 
}

And it's really the same sort of idea as the algorithm to calculate the minimum, except this time we're looking for a place where both b and d are one, so we can try to reset that bit and set all the bits to the right of it.

Warren notes that m can start at 0x80000000 >> nlz(b & d), and once again the same principle holds: it's enough to only look at those bits which are one in b & d, and they can be visited from high to low with bsr

static uint maxOR(uint a, uint b, uint c, uint d)
{
    uint bits = b & d;
    while (bits != 0)
    {
        uint m = 1u << bsr(bits);

        uint temp;
        temp = (b - m) | (m - 1);
        if (temp >= a) { b = temp; break; }
        temp = (d - m) | (m - 1);
        if (temp >= c) { d = temp; break; }

        bits ^= m;
    }
    return b | d;
}

And also, again, we can use that the bit we're looking for in b must be at or to the right of the leftmost bit in a ^ b (c ^ d for d), and that the selected bit doesn't actually have to be changed.

static uint maxOR(uint a, uint b, uint c, uint d)
{
    uint resettableb = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint resettabled = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint candidatebits = b & d & (resettableb | resettabled);
    uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);
    uint targetb = target & resettableb;
    uint targetd = target & resettabled & ~resettableb;
    uint newb = b | (targetb == 0 ? 0 : targetb - 1);
    uint newd = d | (targetd == 0 ? 0 : targetd - 1);
    return newb | newd;
}

Most of the code should be obvious after a moments thought, but something interesting and non-symmetric happens for targetd. There, I had to make sure that a change is not made to both bounds (that would invalidate the whole idea of "being able to make the change without affecting that bit in the result"). In minOR that happened automatically because it looked at positions where the bits were different, so both targets couldn't both be non-zero. Here, one of the bounds has to be explicitly prioritized before the other.

Next post, maybe the same sort of thing but for bitwise AND. Then again, maybe not. I'll see what I can come up with.
edit: bitwise AND it is.

Thursday 13 September 2012

Calculating the lower bound of the bitwise OR of two bounded variables

What does that even mean?

Suppose you have the variables x in [a, b] and y in [c, d]. The question then is: what is the lowest possible value of x | y where x and y are both in their corresponding ranges. In other words, evaluate
\begin{equation} \min _{x \in [a, b], y \in [c, d]} x | y \end{equation} At a maximum of 2⁶⁴ iterations, direct evaluation is clearly not an option for 32-bit integers.

Fortunately, there is an algorithm that has a complexity linear in the number of bits, given by Warren in Hackers Delight, Propagating Bounds through Logical Operations, which the license permits me to show here:

unsigned minOR(unsigned a, unsigned b, 
               unsigned c, unsigned d) {
   unsigned m, temp; 
 
   m = 0x80000000; 
   while (m != 0) {
      if (~a & c & m) {
         temp = (a | m) & -m; 
         if (temp <= b) {a = temp; break;} 
      } 
      else if (a & ~c & m) {
         temp = (c | m) & -m; 
         if (temp <= d) {c = temp; break;} 
      } 
      m = m >> 1; 
   } 
   return a | c; 
}

So let's break down what it's doing. It starts at the MSB, and then it searches for either the highest bit that is zero a and one in c such that changing a to have that bit set and all bits the right of it unset would not make the new a higher than b, or, the highest bit that is zero c and one in a such that changing c to have that bit set and all bits the right of it unset would not make the new c higher than d, whichever one comes first.

That's literally easier to code than to explain, and I haven't even explained yet why it works.
Suppose the highest such bit is found in a. Setting that bit in a does not affect the value of a | c, after all, that bit must have been set in c already so it was already set in a | c, too. However, resetting the bits to the right of that bit however can lower a | c. Notice that it is pointless to continue looking at lower bits - in a there are no more bits to reset, and for c there are no more bits that have the corresponding bit in a set.

Warren notes that m could start at 0x80000000 >> nlz(a ^ c) (where nlz is the "number of leading zeros" function), meaning it starts looking at the first bit that is different in a and c. But we can do better. Not only can we start at the first bit which is different in a and c, we could look at only those bits. That requires frequent invocation of the nlz function (or bsr, bit scan reverse, giving the index of the leftmost bit), but it maps to a fast instruction on many platforms.

uint minOR(uint a, uint b, uint c, uint d)
{
    uint bits = a ^ c;
    while (bits != 0)
    {
        // get the highest bit
        uint m = 1u << (nlz(bits) ^ 31);
        // remove the bit
        bits ^= m;
        if ((a & m) == 0)
        {
            uint temp = (a | m) & -m;
            if (temp <= b) { a = temp; break; }
        }
        else
        {
            uint temp = (c | m) & -m;
            if (temp <= d) { c = temp; break; }
        }
    }
    return a | c;
}

One interesting consequence of looking only at the bits that are different is that the second if disappears - the case where the bits are equal is ruled out by looking only at the different bits in the first place.

But that is not all. The bit positions at which the <= operators could return true, are precisely all those at and to the right of one important point: the highest set bit in a ^ b (or c ^ d for the other bound). Why? Well the upper bounds are not lower than the lower bounds, so the first bit at which they differ must be the first position at which the lower bound has a zero where the upper bound has a one. Setting that bit to one and all bits to the right to zero in the lower is clearly valid (ie doesn't make it higher than the upper bound), but whether that bit can actually be set depends on the other lower bound as well.

What that means in practical terms, is that the value of m that first passes the tests is directly computable. No loops required. Also, because the test to check whether the new bound is still less than or equal to the upper bound isn't necessary anymore (by construction, that test always passes), the bit doesn't even have to be set anymore - without the test the new value isn't really needed, and the entire idea was that setting that bit would not change the result, so setting it is pointless.

uint minOR(uint a, uint b, uint c, uint d)
{
    uint settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
    uint settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
    uint candidatebitsa = (~a & c) & settablea;
    uint candidatebitsc = (a & ~c) & settablec;
    uint candidatebits = candidatebitsa | candidatebitsc;

    uint target = candidatebits == 0 ? 0 : 1u << bsr(candidatebits);
    uint targeta = c & target;
    uint targetc = a & target;

    uint newa = a & ~(targeta == 0 ? 0 : targeta - 1);
    uint newc = c & ~(targetc == 0 ? 0 : targetc - 1);
    return newa | newc;
}

Sadly, there's an awful lot of conditionals in there, which could be branches. But they could also be conditional moves. And on x86 at least, both bsr and lzcnt set a nice condition flag if the input was zero, so it's really not too bad in practice. It is, in my opinion, a pity that there aren't more instruction to deal with leftmost bits, while instruction that deal with the rightmost bit are being added. They are nice, I will admit, but the rightmost bit could already be efficiently dealt with, while the leftmost bit is somewhat problematic.

Next post, the same thing but for the upper bound. This post is the start of a series of posts that address the propagation of intervals through bitwise operations.

Divisibility and modular multiplication, even divisors

As promised, I will now expand the divisibility testing by modular multiplication algorithm to handle even divisors.

Recall that a number y that has a rightmost bit can be written as y = d * 2ⁿ where d is odd. A number x is divisible by y = d * 2ⁿ iff it is divisible by 2ⁿ and by d. And both of those problems have already been solved in earlier posts, so:

bool IsDivisibleBy(uint32_t x, uint32_t divisor)
{
    uint32_t poweroftwo = divisor & -divisor;
    uint32_t d = divisor >> bsf(divisor);
    return (x & (poweroftwo - 1)) == 0 && 
        (d == 1 || IsDivisibleByOdd(x, d));
}

Pretty straightforward. Except perhaps the d == 1 bit. Recall that IsDivisibleByOdd doesn't want the divisor to be one, so that case has to be avoided. And if d is one, that means the divisor was a power of two. It even works if divisor is one; poweroftwo would also be one, and x & 0 is clearly always zero.

And bsf is not defined. The implementation would be strongly platform dependent, and not particularly enlightening.

Now, on to the performance part. Does this help? The answer is the same as last time - no, usually not. Except in some select cases, such as when you need to test a whole array for divisibility by the same number, which is not a compile-time constant.

Added on 29 January 2014: as Jasper Neumann pointed out, if right-rotate can be used, the part that tests whether the lower bits are zero can be merged with the other test, as described in, for example, Hacker's Delight chapter 10-16 (Test for Zero Remainder after Division by a Constant) and gmplib.org/~tege/divcnst-pldi94.pdf.

That concludes the divisibility series (for now, anyway). Next post, something completely different.

The basics of working with the rightmost bit

note: I rewrote this post because of its popularity.

In this post I will assume that the reader knows the bitwise operators and what they do (if not, see Low Level Bit Hacks You Absolutely Must Know), to avoid having to cover the basics.

The rightmost bit (not to be confused with the least-significant bit), also called "rightmost 1", is the lowest bit that is set (rightmost zero is the lowest bit that is not set). So zero, unlike all other numbers, doesn't have a rightmost bit. The rightmost bit is interesting because surprisingly many useful operations can be done on it.

Here's a small selection of potentially useful basic "rightmost bit/zero operations":

`x - 1`	Remove the rightmost bit and smear it to the right. (zero is interpreted as having a rightmost bit just beyond the msb)
`x & (x - 1)`	Remove the rightmost bit.
`x & -x`	Isolate the rightmost bit.
`x \| (x + 1)`	Set the rightmost zero.
`x \| ~(x + 1)`	Isolate the rightmost zero (as an inverted mask).

How it works

Manipulation of the rightmost bit makes use of the properties of the carry/borrow process, namely that it propagates from the lsb to the msb, changes the bits it touches, and can be stopped. For example, the simplest operation operation, x - 1 just runs a borrow through the zeroes on the right, changing all of them to ones, the borrow is stopped by the rightmost one (which is changed to a zero). Effectively it inverted the part of the number that includes the rightmost bit and spans to the right, and left the rest alone. ANDing that number with x (as in the "remove the rightmost bit" operation) then sets that rightmost part to all zeroes (because x & ~x = 0) and leaves the rest of the number alone (because x & x = x).

Since -x = ~(x - 1), clearly negation is a kind of "opposite" of subtracting 1; the rightmost part (including the rightmost 1) is not change, and the leftmost part is changed. So x & -x also gives the opposite thing of x & (x - 1), namely just the rightmost bit (instead of everything else).

The other two operations from the table can be derived by taking ~operation(~x) and simplifying it:

~(~x & (~x - 1)) =
// use the definition of subtraction: a - b = ~(~a + b)
~(~x & ~(x + 1)) =
// use De Morgan's law
x | (x + 1)

~(~x & -~x) =
// use the definition of negation: -a = ~(a - 1)
~(~x & ~(~x - 1)) =
// use De Morgan's law
x | (~x - 1) =
// use the definition of subtraction: a - b = ~(~a + b)
x | ~(x + 1)

Using the same principles, more complicated operations can be constructed. For example (by chaining two operations on the rightmost bit), by first smearing the rightmost 1 to the right, the rightmost run of ones is now in a good position to get rid of it (by adding one and ANDing):
(x | (x - 1)) + 1 & x

The same can also be accomplished differently (no better, just different), by instead of smearing the rightmost bit to the right so that it can be affected by the +1, adding a big enough number - that number is, of course, the isolated rightmost bit, so we get this:
x + (x & -x) & x

A good overview of the basic rightmost-bit operations can be found here.

Next time, why rightmost bits are relevant in testing divisibility by even numbers.

Divisibility and modular multiplication

An other post from CodeProject (with small changes). I promise I'll post something new next time.

Typically, all computer math is done modulo 2³² (or 2^something, which the rest of my post trivially generalizes to). This leads to sometimes surprising effects, such as that the sum of a and b, while both positive, can end up being lower than min(a, b), which is rather well-known and known as "overflow" (often treated like something bad, which it can be). Less well known is that it also means that some numbers have a multiplicative inverse, ie a number x^-1 such that x^-1x=1. As mentioned in the wikipedia article, the numbers which have multiplicative inverses are precisely those coprime to the modulo. The modulo is a power of two, which means that a number has a multiplicative inverse iff it is odd.

And it turns out to be actually useful, too. One application is, as you might guess from the title, divisibility testing.

Multiplying a number by an odd number is reversible - you can take the multiplicative inverse of the odd number and then multiply by it to get the original number back. Put differently, the function f(x) = x * k (for k odd) is a bijection.

Modular multiplication is associative, so a multiple of k, say n * k multiplied by inv(k) (the multiplicative inverse of k), is n, because

(n * k) * inv(k) =
// use associativity
n * (k * inv(k)) =
// use definition of multiplicative inverse
n * 1 =
// multiplicative identity
n

That means that in x * inv(k), the multiples of k "use up" the results from 0 through (2³²-1)/k, they can't be used twice because it's a bijection, leaving just the numbers bigger than (2³²-1)/k for non-multiples-of-k. Which suggest a very simple divisibility test:

static bool IsDivisibleByOdd(uint x, uint divisor)
{
    if ((divisor & 1) == 0)
        throw new ArgumentException("divisor must be odd");
    uint d_inv = inv(divisor);
    uint biggest = uint.MaxValue / divisor;  // problem right here
    return (x * d_inv) <= biggest;
}
 
static uint inv(uint d)
{
    // see Hacker's Delight,
    // Computing the Multiplicative Inverse by Newton's Method
    // use extra iteration when extending to 64 bits
    uint x = (d * d) + d - 1;
    uint t = d * x;
    x *= 2 - t;
    t = d * x;
    x *= 2 - t;
    t = d * x;
    x *= 2 - t;
    return x;
}

This may seem at first glance not to help at all, but for a constant divisor all the scary operations can be precomputed. It even has some use for unknown divisors, as long as the inverse and upper limit can be reused often enough.

New enough versions of GCC and Clang can perform this optimization when the divisor is a constant, but not yet in cases where the same (but unknown) divisor is re-used often.

This method can be extended to even divisors, with some complications.

Wednesday 12 September 2012

Divisibility and digital roots

To start off this blog, I'll start with a subject that I wrote about on CodeProject, divisibility and digital roots.

It is well known that a number is divisible by 9 if and only if its digital root is 9. Less well known is that a similar trick kind applies to numbers other than 9, but doesn't really work out.

In order to make this trick "work" (I'll get to why it sometimes doesn't) for number k, the digit at position i has to be multiplied by base^i - [the biggest multiple of k <= base^i] before adding it to the (modified) digital sum.

For example for k = 7, base = 10, you'd multiply the ones position by 3, the tens position by 2, the hundreds position by 6, and so forth (3, 2, 6, 4, 5, 8, then it repeats).

It does transform every multiple of 7 into a multiple of 7 (and every non-multiple-of-7 into a non-multiple-of-7), but it can be the same number, for example 14: 3 * 4 + 2 * 1 = 14, or it can even be a bigger number, for example 9.

But we're programmers, so the base isn't 10. It can be 16. 6 * 6 = 36, so every (positive integer) power of 16 ends in a 6, which means that the nearest lower multiple of 5 is only 1 away. So for k = 5, it works out to a factor of 1 at every position.

Even better, 16^n-1 is divisible by 15, so for base 16, k = 15 works out well too, with a factor of 1 at every position. This leads to the following algorithm:

static bool IsMultipleOf15(int x)
{
    // lookup table to speed up last step
    const ulong lookuptable = 0x1000200040008001;
    int t = (x & 0x0F0F0F0F) + ((x & unchecked((int)0xF0F0F0F0)) >> 4);
    t = (t & 0x001F001F) + ((t & 0x1F001F00) >> 8);
    t = (t & 0x0000003F) + ((t & 0x003F0000) >> 16);
    t = (t & 0xF) + ((t & 0x70) >> 4);
    return ((lookuptable >> t) & 1) != 0;
}

15, of course, has factors 3 and 5, so the same code works to test for divisibility by 3 or 5 just by changing the lookup table to 0x9249249249249249 or 0x1084210842108421, respectively (the two of those ANDed together gives the lookup table for 15, of course). I haven't encountered a situation where this is useful; modulo by a constant is optimized by every sane compiler so this is never an optimization, just a curiosity (or perhaps something to torture interviewees with).

In the next post, I'll cover a divisibility testing algorithm that is actually useful.