Changes between Version 2 and Version 3 of A5/1

Show
Ignore:
Timestamp:
03/03/10 14:20:23 (6 months ago)
Author:
sascha
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • A5/1

    v2 v3  
    11The A5/1 algorithm has these implementations: 
     2 
     3The implementation is specified as an argument to the device. This makes it 
     4possible use different implementations concurrently. 
     5A practical application of this feature is during lookup, where the very last 
     6steps in the chain computation are handled by the sharedmem implementation, 
     7while the bulk of the chains are produced with the bitslice code. 
    28 
    39== NIVIDIA CUDA == 
    410 
    5  * bitslice 
     11||implementation||concurrent chains per SMP||number of A5/1 rounds per second||total number of rounds per second per SMP|| 
     12||bitslice||2048||6800||14M|| 
     13||bitslice2||4096||4500||18M|| 
     14||sharedmem||256||20000||5M|| 
    615 
    7 currently not maintained: 
     16bitslice2 is only available on GT200 class GPUs and is the same as bitslice except that 2 times as many threads are started. 
     17The GT200 GPUs have twice as many registers per SMP. 
     18 
     19currently not maintained or for testing purposes only: 
    820 
    921 * simple 
    10  * sharedmem 
    1122 * mixedmem 
    1223 * interleaved 
     
    1425== bitslice == 
    1526 
    16 uses a vertical arrangement of the data. 
     27uses a vertical arrangement of the data. This implementation achieves the highest throughput. 
     28These options are given to the device option: 
     29For example: 
     30 
     31{{{ --device cuda:blocks=4:implementation=bitslice }}} 
    1732 
    1833 blocks=integer:: 
    1934  the number of blocks to use. should be the number of Streaming Multiprocessors (= 8 cores) for 
    2035  devices before Compute capability 1.0 and twice that number for more sophisticated hardware. 
     36  Best value chosen automatically. 
    2137 threads=integer:: 
    2238  must be 128 
    2339 
    24 == simple == 
     40== simple (for testing only) == 
    2541 
    2642straightforward one bit at a time and very slow. do not use it. 
     
    2844== sharedmem == 
    2945 
    30 uses the shared memory of the GPU. this is the fastest for more than 1 or 2 blocks. 
     46uses the shared memory of the GPU. this is the fastest in terms of operations per second. 
     47These options can be given to the device option: 
    3148 
    32 == mixedmem == 
     49 blocks=integer:: 
     50  number should be the number of SMPs 
     51 threads=integer:: 
     52  Defaults to 256. By changing this one can trade throughput for latency. Values should 
     53  by multiples of 32. 
     54 
     55== mixedmem (deprecated) == 
    3356 
    3457uses both the shared memory and the global memory. this one is the fastest for a single block 
    3558but suffers from memory bus shortage for any significant number of blocks. 
    3659 
    37 == interleaved == 
     60== interleaved (deprecated) == 
    3861 
    3962one can mix sharedmem and mixedmem, so that most cores use shared memory only and some use global