Thursday, 24 November 2016

Parallel prefix/suffix operations

After a "brief" hiatus..

Parallel prefix/suffix operations is a family of operations that follow either this template:

x = x ⊕ (x >> 1);
x = x ⊕ (x >> 2);
x = x ⊕ (x >> 4);
x = x ⊕ (x >> 8);
x = x ⊕ (x >> 16);
Or this template:
x = x ⊕ (x << 1);
x = x ⊕ (x << 2);
x = x ⊕ (x << 4);
x = x ⊕ (x << 8);
x = x ⊕ (x << 16);
Where ⊕ is either OR or XOR. AND is pointless, but perhaps there are undiscovered useful variations.

There is some disagreement (stemming from the usual "which end of an integer is the front" question) about which of them should be called the "parallel prefix" and which the "parallel suffix", for the purpose of this post I'll be calling the first one (with the shifts to the right) "parallel prefix". Either way, these operations are bit-level analogues to the more well known prefix sum, prefix max, prefix product, etc. The word "parallel" in the name refers to the bit-level parallelism, which has the same structure as the simple (not work-efficient) parallel prefix sum algorithm.

Some members of the family

The best-known member of parallel prefix/suffix family is the parallel prefix with ⊕ = XOR (PP-XOR), which converts Gray code back to normally-ordered integers and produces the parity of an integer in the lowest bit.

The parallel suffix with ⊕ = XOR (PS-XOR) is a carryless multiplication by -1, which isn't very interesting by itself (probably), but gives a hint about the algebraic structure that these operations give rise to.

The parallel prefix with ⊕ = OR (PP-OR) often shows up whenever bitwise operations are interacting with ranges/intervals, since PP-OR(low ^ high) gives a mask of the bits that are not constrained by the interval. I have used this operation (though I did not name it then) several times in my series on propagating bounds through bitwise operations.
This operation lends itself to some optimizations (I put "performance" in the title, which I admit I've mostly ignored), for example on x64 you could implement it as

   lzcnt rax, rax  ; input in rax
   sbb rdx, rdx
   not rdx
   shrx rax, rdx, rax
Or:
   lzcnt rax, rax
   mov edx, 64
   sub edx, eax
   or rax, -1
   bzhi rax, rax, rdx
The each have their pros and cons, and hopefully there's a better way to do it, but I couldn't really find any. I made sure to avoid the infamous false dependency that popcnt, tzcnt and lzcnt all share on any Intel processor that implements them to date. Probably the biggest problem here is that both versions require BMI2, that can be avoided, eg (suggested by Jasper Neumann)
   xor edx, edx    // provide a zero register for cmov
   bsr ecx, eax
   mov eax, -1
   not ecx         // flags not modified
   cmovz eax, edx
   shr eax, cl

Properties/structure

To start with the obvious, the prefix and suffix versions are exactly each others mirror image. So I'm going to look just at the suffix part of the family, for the prefix part just mirror everything.

PS-XOR(x) is clmul(x, -1)

Since every step is a clmul by 1 + 2k and if you clmul those constants together you get -1 (the two's complement -1, not the -1 in the ring formed by XOR and clmul over bitvectors of length k, which would just be 1 again), or from a more intuitive angle, what the PS-XOR is supposed to do in the first place is XOR each bit into all higher bits. So it inherits the properties, such as inversibility (-1 is odd). The inverse of PS-XOR is x ^ (x << 1).

It also inherits that PS-XOR(x) ^ PS-XOR(y) == PS-XOR(x ^ y) from the distributivity of clmul.

Since clmul is commutative and associative, the steps in PS-XOR can be reordered.

PS-OR(x) is or_mul(x, -1)

This isn't as nice (XOR and clmul form a ring structure) since OR doesn't form a group because it has no negation (this is just a fancy way of saying that you can't un-OR numbers, in the way you can un-ADD by subtraction or un-XOR with XOR again) and so it doesn't extended to a ring, but it can be extended into a semiring by defining the following multiplication:

static uint or_mul(uint a, uint b)
{
    uint r = 0;
    for (int i = 0; i < 32; i++)
    {
        if ((a & 1) == 1)
            r |= b; // the only difference is here
        a >>= 1;
        b <<= 1;
    }
    return r;
}
It's not fancy math notation but you get the idea.

This (with OR) forming a commutative semiring (proof "left as an exercise to the reader"), it has some nice properties:

  • or_mul(a, b) == or_mul(b, a) commutivity
  • or_mul(a, or_mul(b, c)) == or_mul(or_mul(a, b), c) associativity
  • or_mul(a, b | c) == or_mul(a, b) | or_mul(a, c) left distributivity
  • or_mul(a | b, c) == or_mul(a, c) | or_mul(b, c) right distributivity
But, of course, no multiplicative inverses. Except when multiplying by 1, but that doesn't count since it's the multiplicative identity.

PS-OR(x) == or_mul(x, -1), and the individual steps of the PS-OR are of the form x = or_mul(x, 1 + 2k). The or_mul of all those constants together is, unsurprisingly, -1 (though this notation is now slightly odd since this was all a semiring, I mean the element that has all bits set).

And again, PS-OR inherits PS-OR(x) | PS-OR(y) = PS-OR(x | y) from distributivity.

And again, since or_mul is commutative and associative, the steps in PS-OR can be reordered.

PS-OR(x) is also x | -x

This one doesn't have a nice mirrored equivalent, just the obvious one where you insert a bit-reversal before and after. It also doesn't have an obvious relation to or_mul. As for why it's the case, given that (in string notation) -(a10k) = (~a)10k; a10k | -(a10k) = (~a|a)10k = 110k, so it copies the lowest set bit into all higher bits, just as PS-OR does.

Sunday, 13 April 2014

A bitwise relational domain

This is the thing that I was referring to in my previous post where I said it didn't work out as well as I'd hoped. That's still the case. But it's not a complete failure, it's just not as good as I had hoped it might be, and it's still interesting (and, as far as I know, new).

The basic idea is really the same one as in Miné's A Few Graph-Based Relational Numerical Abstract Domains, namely a matrix where the element i,j says something about the relation between the variables i and j, and there's a closure operator that very closely resembles the Floyd–Warshall algorithm (or matrix multiplication, which comes down to the same thing). Of course saying that the difference of two variables must be in some set isn't great for bitwise operations, so my first thought was to change it to a xor, with Miné's bitfield domain (the one I've written several other posts about) as the Basis. For the closure operator, the multiplication is represented by the bitwise-xor from the bitfield domain, addition is bitwise-and from the bitfield domain. The rest is similarly easy.

The problem with that idea is that it tends finds almost no relations. It can represent some non-trivial relations, for example that one variable is (partially) equal to and other xor-ed with a constant, but it's just not enough.

The problem is that xor just doesn't remember enough. So the next idea I had was to abandon xor, and go with an operator that doesn't throw away anything, and use 4-tuples to represent that set instead of 2-tuples. Due to lack of inspiration, I named the elements of the tuple (a, b, c, d), and they work like this:

ax = 0, y = 0
bx = 0, y = 1
cx = 1, y = 0
dx = 1, y = 1
Meaning that if, say, bit 0 in a is set in the element at (x,y) then variables x and y can be even simultaneously. On the diagonal of the matrix, where x=y, b and c must both be zero, because a variable can not be unequal from itself.

The "multiplication" operator for the closure works like this:

aij &= aik & akj | bik & ckj
bij &= aik & bkj | bik & dkj
cij &= dik & ckj | cik & akj
dij &= dik & dkj | cik & bkj
For all k. This essentially "chains" i,k with k,j to get an implied constraint on i,j.

To show how powerful this is, here is what happens after asserting that z = x & y

x
y
z
x-1,0,0,-1-1,-1,-1,-1-1,-1,0,-1
y-1,-1,-1,-1-1,0,0,-1-1,-1,0,-1
z-1,0,-1,-1-1,0,-1,-1-1,0,0,-1
And now some cool things can happen. For example, if you assert that z is odd, the closure operation will propagate that back to both x and y. Or if you assert that w = x | z, it will deduce that w == x.

There is a symmetry between the upper triangle and the lower triangle. They're not the same, but an element (a, b, c, d) is mirrored by an element (a, c, b, d). One of those triangles can easily be left out, to save some space. Elements that are (-1, -1, -1, -1) could also be left out, but that doesn't typically save many elements - if something non-relational is known about a variable (that is, there is an element on the diagonal that isn't (-1, 0, 0, -1)), then everything in the same row and everything in the same column will have that information in it as well. But an element (i,j) exactly equal to (ai,i & aj,j, ai,i & dj,j, di,i & aj,j, di,i & dj,j) can be left out. That removes all elements that describe a trivial relation of the kind that says that the combination of x and y must be in the Cartesian product of the sets that x and y can be in, which is completely redundant.

But it does have problems. If you xor (or add or subtract) two unconstrained variables, nothing happens. That sort of 3-variable relation isn't handled at all. This led me to explore a 3D variant of this domain (with 8-tuples), which is really great if you only consider its expressive power, but it likes to take a large amount of space, and its closure operator is very inefficient.

The algorithms I presented earlier for addition in the bitfield domain still work in this relational domain, and with just a couple of simple changes they can make use of some relational information. They want to know something about p = x ^ y and g = x & y, relational information can be available for both of them. For example, if it is known that x & y == 0 (indicated by a 4-tuple (a, b, c, 0)), g will be known to be 0, all carries will be known to be 0, and two relations will be made that look exactly the same as if you had asserted z = x | y. Of course that only works if x & y == 0 is known before the addition.

Assertions such as x & y == 0 (more generally: x ⊗ y is in (z, o) where (z, o) is an element from the unrelational bitfield domain) can be done directly (without inventing a temporary to hold x ⊗ y and asserting on it), and actually that's simpler and far more efficient.

Tuesday, 4 February 2014

Addition in the bitfield domain, alternative algorithm

This is obviously a continuation of this post, I could have published this one first but it didn't seem that interesting to me at the time. But it's simpler to understand, it's (as far as I know) "new" in the sense that it hasn't been written down anywhere public, and the thing I really wanted to post about didn't really work out as well as I had hoped.. so here it is.

The idea behind this algorithm is to take the software version of a Kogge-Stone adder, and replace all the variables by z/o pairs and all the operators by their "lifted" equivalents. So just take this:

uint p = x ^ y;
uint g = x & y;

g |= p & (g << 1);
p &= p << 1;

g |= p & (g << 2);
p &= p << 2;

g |= p & (g << 4);
p &= p << 4;

g |= p & (g << 8);
p &= p << 8;

g |= p & (g << 16);

uint result = x ^ y ^ (g << 1);
Do the replacement:
// p = x ^ y
uint pz = xz & yz | xo & yo;
uint po = xz & yo | xo & yz;
// g = x & y
uint gz = xz | yz;
uint go = xo & yo;
// g |= p & (g << 1)
gz = (gz << 1 | 1 | pz) & gz;
go = (go << 1) & po | go;
// p &= p << 1
pz = pz << 1 | 1 | pz;
po = po << 1 & po;
// g |= p & (g << 2)
gz = (~(~gz << 2) | pz) & gz;
go = (go << 2) & po | go;
// p &= p << 2
pz = ~(~pz << 2) | pz;
po = po << 2 & po;
// g |= p & (g << 4)
gz = (~(~gz << 4) | pz) & gz;
go = (go << 4) & po | go;
// p &= p << 4
pz = ~(~pz << 4) | pz;
po = po << 4 & po;
// g |= p & (g << 8)
gz = (~(~gz << 8) | pz) & gz;
go = (go << 8) & po | go;
// p &= p << 8
pz = ~(~pz << 8) | pz;
po = po << 8 & po;
// g |= p & (g << 16)
gz = (~(~gz << 16) | pz) & gz;
go = (go << 16) & po | go;
// g <<= 1
gz = ~(~gz << 1);
go = go << 1;
// z = x ^ y
uint zz = xz & yz | xo & yo;
uint zo = xz & yo | xo & yz;
// result = z ^ g
uint resz = zz & gz | zo & go;
uint reso = zz & go | zo & gz;

It got longer, but not actually more complicated - it's still the same algorithm, and you could even write it exactly the same way if you used operator overloading.

This algorithm doesn't need a special case for "empty" inputs, because the two lifted XORs at the end preserve emptiness (though the lifted left shifts in the calculation of g do not).

Saturday, 28 September 2013

Determining the cardinality of a set described by a mask and bounds

In other words, calculating this \begin{equation} | \left\{ x | (x \& m) = v \land x \ge a \land x \le b \right\} | \end{equation}

A BDD (which haroldbot uses to solve this) has no trouble with that at all (form the query like unsigned solve x & m = v && x >= a && x <= b), but the structure the problem is so nice that I thought it should be possible to do better. And it is, though the solution I'm about to present is probably far from optimal. I don't know how to significantly improve on it, but I just get a very non-scientific gut-feeling that there should be a fundamentally better way to do it.

The cardinality of the set without the bounds (only the mask) is obviously trivial to compute (as 1 << popcnt(~m)), and the bounds divide that set in three parts: items that are lower than the lower bound ("left part"), items that are higher than the upper bound ("right part")), and items that are actual set of interest ("middle part"). The main idea it is built on is that it's relatively easy to solve the problem if it didn't have a lower bound and the upper bound was a power of two, with a formula roughly similar to the one above. Using that, all the powers of two that fit before lower bound can be counted from high to low, giving the cardinality of the "left part". The same can be done for the "right part", actually in exactly the same way, by complementing the mask and the upper bound. Obviously the cardinality of the "middle part" can be computed from this.

And here's the code. It doesn't like it when the cardinality is 232, and watch out for weird corner cases such as when the lower bound is bigger than the upper bound (why would you even do that?). It usually works, that's about as much as I can say - I didn't prove it correct.

static uint Cardinality(uint m, uint v, uint a, uint b)
{
    // count x such that a <= x <= b && (x & m) == v
    // split in three parts:
    // left  = 0 <= x < a
    // right = b < x <= -1
    // the piece in the middle is (1 << popcnt(~m)) - left - right
    uint left = 0;
    uint x = 0;
    for (int i = 32 - 1; i >= 0; i--)
    {
        uint mask = 1u << i;
        if (x + mask <= a)
        {
            x += mask;
            uint mask2 = 0 - mask;
            if ((x - 1 & m & mask2) == (v & mask2))
            {                        
                uint amount = 1u << popcnt(~(m | mask2));
                left += amount;
            }
        }
    }
    uint right = 0;
    uint y = 0;
    for (int i = 32 - 1; i >= 0; i--)
    {
        uint mask = 1u << i;
        if (y + mask <= ~b)
        {
            y += mask;
            uint mask2 = 0 - mask;
            if ((y - 1 & m & mask2) == (~v & m & mask2))
            {
                uint amount = 1u << popcnt(~(m | mask2));
                right += amount;
            }
        }
    }
    uint res = (uint)((1UL << popcnt(~m)) - (left + right));
    return res;
}
The loops can be merged of course, but for clarity they're separate here.

If you have any improvements, please let me know.

Thursday, 22 August 2013

Addition in the bitfield domain

Bitfield domains are abstract domains that track the possible values of every bit independently. For example, they can express things like "all values where the third bit is set".

There are two obvious "simplest useful" bitfield domains:

  1. The first one is probably the most obvious one: a bitvector m that has a 1 at every position that has a known value, and a bitvector v with the values of the known bits (ones at unknown positions are ignored). This domain turns out to be largely inconvenient, for several reasons.
  2. The second one is perhaps slightly less intuitive: the one defined in Miné-WING12, section 2.8, namely a bitvector z that has a 1 at every position that can be zero, and a bitvector o with a 1 for every position that can be one. This domain turns out to be pretty good, but the "o" naming is sometimes confusing.

So, what's so bad about the first domain?

Firstly, this:

Empty set Universal set
first not representable m = 0 (r)
second (z | o) != -1 (r) (z & o) == -1
The entries marked (r) have redundant representations, which is usually not a problem, but it can be a little annoying at times. This table means that in general, they're not just two different representations of the same thing. Still, every value of the first domain can be converted to the second domain by z = ~m | ~v; o = ~m | (m & v), and most values of the second domain can be converted to the first domain by m = z ^ o; v = o or m = z ^ o; v = ~z, though that converts empty sets into non-empty sets.

But that's not all. The other problem with the first domain is the considerably more annoying implementation of some abstract operations. For example

bitwise AND (first domain)
m = (a.m & b.m) | (a.m & ~a.v) | (b.m & ~b.v);
v = a.v & b.v;
bitwise AND (second domain)
z = a.z | b.z;
o = a.o & b.o;
bitwise OR (first domain)
m = (a.m & b.m) | (a.m & a.v) | (b.m & b.v);
v = a.v | b.v;
bitwise OR (second domain)
z = a.z & b.z;
o = a.o | b.o;
Union (first domain)
m = a.m & b.m & (~a.v ^ b.v);
v = a.v;
Union (second domain)
z = a.z | b.z;
o = a.o | b.o;
Intersect (first domain)
if ((a.v ^ b.v) & a.m & b.m) Empty
m = a.m | b.m;
v = a.v | b.v;
Intersect (second domain)
z = a.z & b.z;
o = a.o & b.o;
But then, sometimes it doesn't matter at all, for example
left shift by a constant (first domain)
m = ~(~m << n);
v = v << n;
left shift by a constant (second domain)
z = ~(~z << n);
o = o << n;
And rarely, the first domain is clearly better
bitwise XOR (first domain)
m = a.m & b.m;
v = a.v ^ b.v;
bitwise XOR (second domain)
z = (a.z & b.z) | (a.o & b.o);
o = (a.o & b.z) | (a.z & b.o);
And then there's bitwise complement, which is special. In the first domain, it only costs a single "not". In the second domain, you just swap the z with the o. So which domain is better for bitwise complement? It depends. If you really have to do the swap, that's probably worse (hey Intel and AMD, please implement xchg with register renaming like you do with fxch). But you may be able to do a "virtual swap", where you just swap the meanings of the fields instead of their contents. That's free, but it's usually not applicable. However, you can use it to define subtraction in terms of addition at no extra cost, in the worst case by swapping a couple of things manually.

Incidentally, one interesting thing about the first domain is that the operations for v is the same as the concrete operation that you're abstracting. That's because whatever garbage happens to be in the unknown bits, by definition they can not affect bits that are known in the result, so the whole "bit is unknown"-business can be ignored for v and instead all the difficulty is moved to the computation of m.

That's all very nice, but the interesting part is:

Addition, as promised.

All abstract operations up till now were very simple. Addition, however, is a little tricky. First I'll describe the addition algorithm for the first domain .. and then I won't describe one for the second domain (it can easily be defined in terms of the addition algorithm for the first domain though).

It's based on the concepts of a carry look-ahead adder, with "propagates" and "generates" information, but of course it's not actually a CLA because the input isn't fully known and the result is a mask of where the result of the addition would be known.

The basic idea, that the entire algorithm is based around, is that if a carry is known somewhere, a whole stretch of positions that would propagate can be selected from the point of the known carry upwards. That's seems very easy: (p + g & ~p) - g but that's not quite right and most of the algorithm have to do with fixing that up.

For example, if there is more than one generating bit in the same run of propagators, then things go south. But that can be fixed, just select only the rightmost ends of all runs of generators:
g & ~(g << 1)

That actually doesn't even solve the problem though. The runs of generators are broken now, but a single run of propagators could still span all of them. So instead of checking whether p + g has bits that weren't in p, check whether it has bits there are not in p or that are both in p and in g (the new g).

The fixups just keep piling up really, so I'm almost certain that at least a couple of operations can be shaved off with some further cleverness. Anyway, here's what I have now:

av &= am;
bv &= bm;
uint an = ~av & am;      // not set in a
uint bn = ~bv & bm;      // not set in b
uint g0 = an & bn;       // generates a 0-carry
uint g1 = av & bv;       // generates a 1-carry
uint p = an ^ bn;            // propagates a carry
uint pg1 = av | bv;          // propagates or generates a 1-carry
uint g0l = ~g0 & ((g0 << 1) | 1); // the bit directly to the left of every run of 1s
uint g1f = g1 & ~(g1 << 1);    // the lowest bit of every run of 1s
uint nc = (p + g0l & ~p) - g0l | g0; // bits that have a carry-out of 0
uint m1 = ~pg1 | (pg1 & g1f);
uint c = (pg1 + g1f & m1) - g1f;     // bits that have a carry-out of 1
uint cm = (c | nc) << 1 | 1; // bits that have a known carry-in
uint m = cm & am & bm;     // result is known if carry-in and both inputs are known
uint v = av + bv;

Subtraction (for both discussed bitfield domains) can be implemented using the identity
a - b = ~(~a + b), either by actually doing the complements (which are trivial don't lose any information) or by rewriting a couple of things in the addition algorithm.

Saturday, 10 August 2013

Announcing haroldbot

haroldbot was an ircbot (hence the name) that solves some bitmath problems. The title is actually a lie - haroldbot has been around for a while now. But now it finally got its own website.
haroldbot.nl
Check it out, if you work with bits you will probably find this useful.

Thursday, 30 May 2013

Carryless multiplicative inverse

Note: this post is neither about the normal multiplicative inverse, nor the modular multiplicative inverse. This other post has information about the modular multiplicative inverse, which might be what you were looking for.

Mathematicians may call carryless multiplication "multiplication in GF(2^n)", but that doesn't explain how it works - recall the shift-and-add algorithm for multiplication:
static uint mul(uint a, uint b)
{
    uint r = 0;
    while (b != 0)
    {
        if ((a & 1) != 0)
            r += b;
        a >>= 1;
        b <<= 1;
    }
}
Carryless multiplication is a very simple variation on that: do the addition without carries. That's just a XOR.
static uint cl_mul(uint a, uint b)
{
    uint r = 0;
    while (b != 0)
    {
        if ((a & 1) != 0)
            r ^= b;      // carryless addition is xor
        a >>= 1;
        b <<= 1;
    }
}
It has some applications in complicated cryptography related algorithms, but it also seems like this should be an interesting and powerful operation when working with bits, and it may well be, but I know of almost no uses for it (besides, Intel's implementation is so slow that it often wouldn't help Intel made it over twice as fast in Haswell). But anyway, let's just start with its basic properties: like normal multiplication, it's commutative and associative. It's also distributive, but over xor instead of over addition. None of this is very surprising.

As an aside, using associativity, it can be shown that the parallel suffix with XOR (which does have some known uses in bitmath, for example in implementing compress_right in software), code shown below, is equivalent to a carryless multiplication by -1.
// parallel suffix with XOR
x ^= x << 1;
x ^= x << 2;
x ^= x << 4;
x ^= x << 8;
x ^= x << 16;
Every step is clearly a carryless multiplication, by 3, 5, 17, 257, and 65537 respectively. So it's equivalent to:
clmul(clmul(clmul(clmul(clmul(x, 3), 5), 17), 257), 65537) which can be rearranged (using associativity) to:
clmul(x, clmul(clmul(clmul(clmul(3, 5), 17), 257), 65537)) which works out to clmul(x, -1). Of course it was supposed to work out that way, because every bit of the result should be the XOR of all bits up to (and including) that bit, but it's nice that it also follows easily from a basic property. Incidentally if you have a full-width carryless multiplication, multiplying by -1 also computes the parallel prefix with XOR in the upper bits (the upper bit of the low word, which is the parity of the input, is shared by the suffix and the prefix.)

Carryless multiplication also shares an other property with normal multiplication: there are multiplicative inverses modulo 2n (and also modulo other numbers, but 2n is of particular interest since we're working in that by default anyway). Again there are only inverses for odd numbers, and it's equally obvious (as for normal multiplication) why that should be so - an even multiplier will throw out at least one high order bit. First, here's an example of carrlessly multiplying x by -1 and then carrlessly multiplying that by 3.
x = {d}{c}{b}{a}    // the letters are bits
y = cl_mul(x, -1) = {d^c^b^a}{c^b^a}{b^a}{a}
z = cl_mulinv(-1) = 0011
cl_mul(y, z) = {d^c^b^a ^ c^b^a}{c^b^a ^ b^a}{b^a ^ a}{a} = {d}{c}{b}{a}
Ok, so that worked out well, and it also gives part the answer to exercise 3 in chapter 5 of Hacker's Delight (about whether parallel prefix/suffix with XOR is invertible and how) because a carryless multiplication by -1 is the same as the parallel suffix with XOR. A carryless multiplication of y by 3 is of course just y ^ (y << 1).

But back to actually computing the inverse. The inverse had better be odd, so bit 0 is already known, and for all the other bits follow these steps
  1. if the remainder is 1, stop
  2. if bit k is 0, go to step 5
  3. set bit k in the inverse
  4. xor the remainder with input << k
  5. increment k and go to step 1
Step 4 always resets the offending bit because the input had to be odd, so it's obvious that the remainder always ends up being 1 eventually, and so the algorithm always terminates. Moreover, even in the worst case it only has to process every bit but one, and continuing after the remainder becomes 1 simply does nothing, so step 1 could read "if k is 32" (or some other number, depending on how many bits your ints are wide), which is easier to unroll and better suited for a hardware implementation (not that I've seen this operation implemented in hardware anywhere).
For example, in C# it could look like this:
static uint clmulinv(uint x)
{
    uint inv = 1;
    uint rem = x;
    for (int i = 1; i < 32; i++)
    {
        if (((rem >> i) & 1) != 0)
        {
            rem ^= x << i;
            inv |= 1u << i;
        }
    }
    return inv;
}

Some sample inverses
1  0x00000001
3  0xFFFFFFFF
5  0x55555555
7  0xDB6DB6DB
9  0x49249249
11 0x72E5CB97
13 0xD3A74E9D
15 0x33333333