Changes between Version 2 and Version 3 of A5/1
- Timestamp:
- 03/03/10 14:20:23 (6 months ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
A5/1
v2 v3 1 1 The A5/1 algorithm has these implementations: 2 3 The implementation is specified as an argument to the device. This makes it 4 possible use different implementations concurrently. 5 A practical application of this feature is during lookup, where the very last 6 steps in the chain computation are handled by the sharedmem implementation, 7 while the bulk of the chains are produced with the bitslice code. 2 8 3 9 == NIVIDIA CUDA == 4 10 5 * bitslice 11 ||implementation||concurrent chains per SMP||number of A5/1 rounds per second||total number of rounds per second per SMP|| 12 ||bitslice||2048||6800||14M|| 13 ||bitslice2||4096||4500||18M|| 14 ||sharedmem||256||20000||5M|| 6 15 7 currently not maintained: 16 bitslice2 is only available on GT200 class GPUs and is the same as bitslice except that 2 times as many threads are started. 17 The GT200 GPUs have twice as many registers per SMP. 18 19 currently not maintained or for testing purposes only: 8 20 9 21 * simple 10 * sharedmem11 22 * mixedmem 12 23 * interleaved … … 14 25 == bitslice == 15 26 16 uses a vertical arrangement of the data. 27 uses a vertical arrangement of the data. This implementation achieves the highest throughput. 28 These options are given to the device option: 29 For example: 30 31 {{{ --device cuda:blocks=4:implementation=bitslice }}} 17 32 18 33 blocks=integer:: 19 34 the number of blocks to use. should be the number of Streaming Multiprocessors (= 8 cores) for 20 35 devices before Compute capability 1.0 and twice that number for more sophisticated hardware. 36 Best value chosen automatically. 21 37 threads=integer:: 22 38 must be 128 23 39 24 == simple ==40 == simple (for testing only) == 25 41 26 42 straightforward one bit at a time and very slow. do not use it. … … 28 44 == sharedmem == 29 45 30 uses the shared memory of the GPU. this is the fastest for more than 1 or 2 blocks. 46 uses the shared memory of the GPU. this is the fastest in terms of operations per second. 47 These options can be given to the device option: 31 48 32 == mixedmem == 49 blocks=integer:: 50 number should be the number of SMPs 51 threads=integer:: 52 Defaults to 256. By changing this one can trade throughput for latency. Values should 53 by multiples of 32. 54 55 == mixedmem (deprecated) == 33 56 34 57 uses both the shared memory and the global memory. this one is the fastest for a single block 35 58 but suffers from memory bus shortage for any significant number of blocks. 36 59 37 == interleaved ==60 == interleaved (deprecated) == 38 61 39 62 one can mix sharedmem and mixedmem, so that most cores use shared memory only and some use global
