The TMTO library in its current form uses NVIDIA GPUs for the chain generation. In NVIDIA terms, each chain is computed by one thread and currently the maximum number of threads per physical core is about 32. The chains are completely independent, but use the same memory for lookup tables, so that at some points they have to be executed serially. Depending on the particular algorithm, lookup tables still improve performance.

The library also uses  stxxl which is a library for external memory containers and algorithms, vulgo: STL with disk backend for handling of data sets that are too large to fit into ram. stxxl is only used for sorting.

The language is C++. The CUDA kernels are also written in C++. The different aspects of the table handling are implemented in seperate classes and combined to a single template class and properly inlined. That means that a cross product of those aspects, namely the algorithm implementations, the conditions, the round function conditions and the round function generators has to be made at compile time, because one cannot use virtual functions or function pointers on CUDA GPUs and for speed. The already implemented conditions and generators should suffice for a range of TMTO table concepts (hellman, distinguished point, oechslin rainbows and distinguished point/rainbow mixtures)

The basic mode of operation is a flow of chains from a work generator to a work consumer. In between the chains in progress are copied to the device, computed and copied back each kernel call. The interface to the device is generic, and the only requirements is that the device can provide information on the degree of paralellism and accept an array of values as input.

This slide shows how the table lookups are performed

Here you can see a list of all table lookups performed

This is the reduction function used

Attachments