Optimization history

The cuda implementation started out from a straightforward port of the C code with each thread processing 1 A5/1 state with 64 loop iterations. The speed on a GTX280 was around 32M A5/1 rounds per second (64 bits keystream generated per round).

The first optimization was to compute 4 keystream bits together, using lookup tables to emulate the logic operations. The number of instructions in the loop body went up a bit, but because only 16 loop iterations would be executed, speed reached 140-150M on a GTX280. This implementation was hitech during the HAR2009 talk and the basis of our estimation of 80 cuda cards required for 3 months. It was also the first to be used for table generation.

The next incarnation was the bitslice implementation. Each cuda thread would now compute 64 instead of 1 A5/1 engine. Later it turned out that the 64bit assembly language support was not a real hardware mode and the 64bit implementation was really emulated with 32bit instructions. Due to the new vertical layout of the data, one half of the operations needed for A5/1 were 32 times as fast as before (including the computation of the majority clocking, the computation of the keystream output and the computation of the feedback). The downside was that the A5/1 register shift, which was 3 hardware shift instructions in the old code, now had to be done with 64 * (2 and + 1 or instruction). Put another way, the virtual number of instructions per A5/1 engine per bit produced was N * 1/32 + 6, where N is around 20 (majority, output, feedback ops) and the 6 is 192 / 32. The speed of the bitslice code on nvidia hardware was 500M or 3 times faster than the old 4bit lookup code.

The fact that our bitslice implementation has plenty of independent instructions meant that we could use the 5 VLIW instruction slots of ati hardware up to nearly 100%. And when the algorithm permits this, then ATI hardware is faster than nvidia hardware. A single HD5870 produces around 1600M per second, and the 4870, comparable to the GTX280, would therefore produce 800M per second.

Optimization of the algorithm brought us from 32M to 500M, the port to ATI hardware to an estimated 800M and the availability of a new GPU generation in late 2009 to 1600M. Together with the updated assumption on the amount of known plaintext and the realization that 1/8 of the keyspace is enough if you do 250% more work per round in the table, our initial estimate of 240 GPU months now shrunk to 240 * 160M / 1600M / 8 * 2.5 = 8 GPU month. We are so fast now that it takes less time to generate the tables yourself if you have a HD5870 but less than 1Mbit of internet connectivity to download them.