How to speed up random number creation

Question

How to speed up random number creation

I have code that does a lot of log, tan and cos operations in two-local ones. I need it to be as fast as possible. I am currently using code like

 #include <stdio.h> #include <stdlib.h> #include "mtwist.h" #include <math.h> int main(void) { int i; double x; mt_seed(); double u1; double u2; double w1; double w2; x = 0; for(i = 0; i < 100000000; ++i) { u1 = mt_drand(); u2 = mt_drand(); w1 = M_PI*(u1-1/2.0); w2 = -log(u2); x += tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1)); } printf("%f\n",x); return EXIT_SUCCESS; }

I am using gcc.

There are two obvious ways to speed this up. The first is to choose a faster RNG. The second is to speed up transcendental functions.
For this, I would like to know

How are tan and cos implemented in x86 builds? My processor is an AMD FX-8350, if that matters. (The answer is fcos for cos and fptan for tan .)
How can you use a lookup table to speed up calculations? I need only 32 bits of accuracy. Can you use a table of size 2 ^ 16, for example, to speed up the operations of tan and cos?

Intel Optimization Guide reports

If there is no critical need for evaluating transcendental functions using an extended precision of 80 bits, applications should consider an alternative software approach, such as a table-oriented algorithm using interpolation methods. Transcendental work with these methods can be improved by selecting the required numerical accuracy and size of the lookup table, as well as using parallelism SSE and SSE2 instructions.

According to this very useful table , fcos has a delay of 154 and fptan has a delay of 166-231 .

You can compile my code using

gcc -O3 -Wall random.c mtwist-1.5 / mtwist.c -lm -o random

My C code uses the RNG C code of the Mersenne Twister RNG from here . You can run my code to check it. If you can not, please let me know.

Updating @rhashimoto accelerated my code from 20 seconds to 6 seconds!

It seems that RNG should be accelerated. However, in my tests, http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/SFMT/index.html#dSFMT takes exactly the same amount of time (does anyone see something other). If anyone can find a faster RNG (which will pass all the tests), I would be very grateful.

Please show real timings for any improvement you offer, as it really helps to determine what does or does not work.

+6

performance c assembly math micro-optimization

eleanora May 23 '14 at 16:50

source share

8 answers

You can rewrite

 tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1))

a

 tan(w1)*(M_PI_2-w1) + log(cos(w1)/(M_PI_2-w1)) + log(w2).

Perhaps you can skate with minimax polynomials for material depending on w1 here. Compose 64 of them or so, each for the 1 / 64th range, and you probably only need a degree of 3 or 4.

You calculated w2 as

 w2 = -log(u2);

for homogeneous u2 in (0,1) . So you really compute log(log(1/u2)) . I bet you can use a similar trick to get piecewise polynomial approximations of log(log(1/x)) on chunks (0,1) . (The function acts scary around 0 and 1 , so you may need to do something unusual there.)

+6

tmyklebu May 23 '14 at 19:30

source share

I like @tmyklebu's proposal to create a minimax approximation of the overall calculation. There are several useful tools for this, including the Remez Function Approximation Tool

You can do much better than MT for speed; see, for example, this Dr. Dobbs article: Fast, high-quality, parallel random number generators

Also consider ACML - AMD Math Library to use SSE and SSE2.

+6

Doug currie May 23 '14 at 19:34

source share

Firstly, a small transformation. This is the original amount:

 for(i = 0; i < 100000000; ++i) { u1 = mt_drand(); u2 = mt_drand(); w1 = M_PI*(u1-1/2.0); w2 = -log(u2); x += tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1)); }

This amount is mathematically equivalent:

 for(i = 0; i < 100000000; ++i) { u1 = M_PI - mt_drand()* M_PI; u2 = mt_drand(); x += u1 / tan (u1) + log (sin (u1) / u1) + log (- log (u2)); }

And since it should be equivalent to replacing mt_drand () with 1.0 - mt_rand (), we can enable u1 = mt_drand () * M_PI.

 for(i = 0; i < 100000000; ++i) { u1 = mt_drand()* M_PI; u2 = mt_drand(); x += u1 / tan (u1) + log (sin (u1) / u1) + log (- log (u2)); }

So, this is perfectly divided into two functions of random variable that can be processed separately; x + = f (u1) + g (u2). Both functions are well smooth over long distances. f is smooth enough, say u1> 0.03, and 1 / f is smooth enough for smaller values. g is quite smooth, with the exception of values close to 0 or 1. Thus, we could use, say, 100 different approximations for the intervals [0 .. 0.01], [0.01 .. 0.02], etc. Except that choosing the right approximation takes a lot of time.

To solve this problem: a linear random function in the interval [0 .. 1] will have a certain number of values in the interval [0 .. 0.01], another number of values in [0,01 .. 0,02] and so on. I think you can calculate how many random numbers out of 100,000,000 fall into the interval [0 .. 0.01], assuming a normal distribution. Then you will calculate how many remaining fell in [0.01 .. 0.02] and so on. If you calculated that 999 123 numbers fall into [0.00, 0.01], then you produce this number of random numbers in the interval and use the same approximation for all numbers in the interval.

To find the approximation f (x) in the interval [0.33 .. 0.34], as an example, you approximate f (0.335 + x / 200) for x in [-1 .. 1]. You will get quite good results by taking an interpolating polynomial of degree n interpolating at the Chebyshev nodes xk = cos (pi * (2k - 1) / 2n).

By the way, the performance of the old x87 trigonometric and logarithmic operations is slow. Absolutely nowhere does the approximation of a low degree polynomial approach. And at fairly short intervals, you don’t need a high degree of polynomial.

+4

gnasher729 Jun 01 '14 at 0:40

source share

How does C compute sin () and other math functions?
Unrealistic. A table for 32 bits of precision (which means you want fixed-point math not to double, but I would have to have a length of (2 ^ 32) * 4 bytes. Perhaps you can reduce this if your "32 bits of precision "" is the output signal, not the input (AKA - the range of input signals from 0 to 2PI is represented in <32 bits, which is the only way you could represent angles outside 0 to 2PI anyway). That would be over memory space for non-64-bit computers and plenty of RAM.

+2

Ideahat May 23 '14 at 16:56

source share

As you said, some transcendental functions like sine , cosine and tangent are available as x86 assembly instructions. Probably how the C library implements sin() , cos() , tan() and friends.

However, I did these instructions for a while, redefining functions as macros and deleting every check and error check to leave only a minimum minimum. Testing against the C library, I remember my macro functions, where pretty quickly. Here is an example of my custom tangent function (forgive Visual Studio assembly syntax):

 #define machine_tan_d(result, x)\ __asm {\ fld qword ptr [x]\ fptan\ fstp st(0)\ fstp qword ptr [result]\ }

So, if you are ready to make some assumptions, remove the error handling / checking and make your code platform specific, then you can compress multiple loops using a macro function like me.

Now about the second topic, using the search table, I would not be so sure that it is faster, because you will use the whole operation. The integer table puts extra extra resources in the data cache, which can lead to more frequent cache misses and worse execution time than float operations. But this, of course, can only be done with careful profiling and comparative analysis.

+2

glampert May 23 '14 at 19:20

source share

The processor probably implements tan () and cos () as native instructions (hard or microcode) FPTAN (x87 +) and FCOS (387+) for x86 / 87 (87, coming from the original mathematical coprocessor, Intel 8087).

Ideally, your environment should generate and execute its own x87 instructions, namely FCOS and FPTAN (partial tan). You can save the generated assembler code using the -S flag with gcc to explicitly generate assembler output and search for these instructions. If not, check the use of flags providing generation for the correct submodel processor (or an accessible cabinet) for gcc.

I do not believe that there are sets of SIMD commands (MMX, SSE, 3dNow, etc.) that handle functions such as log (), tan (), cos (), so this is not (direct) but SIMD instructions great for interpolating from previously calculated results or from a table.

Another tact would be to try some of the math optimization options available with the GCC compiler. For example, -ffast-math , which can be dangerous if you do not understand the consequences. The rounding parameter may be sufficient if the speed problem is only related to the difference between the 80-bit extended precision x87 and 64-bit standard double numbers IEEE 754.

I do not expect that you can easily write a suitable and correct approximation for a 32-bit floating-point or fixed-point number and make it faster than your own FPU instructions. , / , , PRNG, .

, , elementary () , , , - , , tmyklebu gnasher729 .

, , @tmyklebu , Remez . , (log, cos ..), single .

, , , , 2- . Jean-Michel Muller ( ). , - , .

.

Hart Computer Approximations (1968 1978 ), , , , Jack Crenshaw Math Toolkit , .

, , " " (PDF). 32 (+) - , FPU, .

+2

mctylr 23 '14 19:46

source share

1) "It depends" ... It depends more on the compiler than on the chip architecture. 2) In the old days, it was popular to use CORDIC methods to implement trigger functions. http://en.wikipedia.org/wiki/CORDIC

0

Brian knoblauch May 23 '14 at 17:59

source share

rhashimoto · Accepted Answer · 2014-05-31T18:21:26+0000

You can try this log(x) replacement that I wrote using SSE2's built-in functions:

 #include <assert.h> #include <immintrin.h> static __m128i EXPONENT_MASK; static __m128i EXPONENT_BIAS; static __m128i EXPONENT_ZERO; static __m128d FIXED_SCALE; static __m128d LOG2ERECIP; static const int EXPONENT_SHIFT = 52; // Required to initialize constants. void sselog_init() { EXPONENT_MASK = _mm_set1_epi64x(0x7ff0000000000000UL); EXPONENT_BIAS = _mm_set1_epi64x(0x00000000000003ffUL); EXPONENT_ZERO = _mm_set1_epi64x(0x3ff0000000000000UL); FIXED_SCALE = _mm_set1_pd(9.31322574615478515625e-10); // 2^-30 LOG2ERECIP = _mm_set1_pd(0.693147180559945309417232121459); // 1/log2(e) } // Extract IEEE754 double exponent as integer. static inline __m128i extractExponent(__m128d x) { return _mm_sub_epi64( _mm_srli_epi64( _mm_and_si128(_mm_castpd_si128(x), EXPONENT_MASK), EXPONENT_SHIFT), EXPONENT_BIAS); } // Set IEEE754 double exponent to zero. static inline __m128d clearExponent(__m128d x) { return _mm_castsi128_pd( _mm_or_si128( _mm_andnot_si128( EXPONENT_MASK, _mm_castpd_si128(x)), EXPONENT_ZERO)); } // Compute log(x) using SSE2 intrinsics to >= 30 bit precision, except denorms. double sselog(double x) { assert(x >= 2.22507385850720138309023271733e-308); // no denormalized // Two independent logarithms could be computed by initializing // base with two different values, either with independent // arguments to _mm_set_pd() or from contiguous memory with // _mm_load_pd(). No other changes should be needed other than to // extract both results at the end of the function (or just return // the packed __m128d). __m128d base = _mm_set_pd(x, x); __m128i iLog = extractExponent(base); __m128i fLog = _mm_setzero_si128(); base = clearExponent(base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); // fLog = _mm_slli_epi64(fLog, 10); // Not needed first time through. fLog = _mm_or_si128(extractExponent(base), fLog); base = clearExponent(base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); fLog = _mm_slli_epi64(fLog, 10); fLog = _mm_or_si128(extractExponent(base), fLog); base = clearExponent(base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); base = _mm_mul_pd(base, base); fLog = _mm_slli_epi64(fLog, 10); fLog = _mm_or_si128(extractExponent(base), fLog); // No _mm_cvtepi64_pd() exists so use _mm_cvtepi32_pd() conversion. iLog = _mm_shuffle_epi32(iLog, 0x8); fLog = _mm_shuffle_epi32(fLog, 0x8); __m128d result = _mm_mul_pd(_mm_cvtepi32_pd(fLog), FIXED_SCALE); result = _mm_add_pd(result, _mm_cvtepi32_pd(iLog)); // Convert from base 2 logarithm and extract result. result = _mm_mul_pd(result, LOG2ERECIP); return ((double *)&result)[0]; // other output in ((double *)&result)[1] }

The code implements the algorithm described in this Texas Instruments test , multiple squaring of the argument and concatenation of the exponent bits. It will not work with denormalized inputs. It provides an accuracy of at least 30 bits.

This works faster than log() on one of my machines, but slower on the other, so your mileage may vary; I am not saying this is the best approach. However, this code actually calculates two logarithms in parallel using both halves of the 128-bit SSE2 word (although the as-only function returns only one result), so it could be adapted into one building block of SIMD calculation of your entire function (and I think that log is the hard part, as cos behaves pretty well). In addition, your processor supports AVX, which can pack 4 double-precision elements into a 256-bit word, and expanding this code to AVX should be simple.

If you decide not to run full SIMD, you can still use both logarithmic slots by pipelining - i.e. compute log(w2*cos(w1)/(M_PI_2-w1)) for the current iteration with log(u2) for the next iteration.

Even if this function compares slower with log separately, it can still be checked using your actual function. This code does not underline the data cache at all, so it may be more friendly with other code that works (for example, cos , which uses lookup tables). Also, sending microinstructions can be improved (or not) using the surrounding code, depending on whether another code uses SSE.

My other advice (repeated from comments):

Try -march=native -mtune=native to get gcc to optimize your processor.
Avoid calling both tan and cos on the same argument - use sincos or a trigger.
Consider using a graphics processor (such as OpenCL).

It seems better to calculate sin instead of cos - the reason is that you can use it for tan_w1 = sin_w1/sqrt(1.0 - sin_w1*sin_w1) . Using cos , which I originally proposed, loses the correct sign when calculating tan . And it looks like you can get good acceleration using the minimax polynomial over [-pi / 2, pi / 2], as the other responders said. The 7-term function in this link (make sure you got minimaxsin and not taylorsin ) seems to work very well.

So, your program has been rewritten with all transcendental approximations of SSE2:

 #include <math.h> #include <stdio.h> #include <stdlib.h> #include <immintrin.h> #include "mtwist.h" #if defined(__AVX__) #define VECLEN 4 #elif defined(__SSE2__) #define VECLEN 2 #else #error // No SIMD available. #endif #if VECLEN == 4 #define VBROADCAST(K) { K, K, K, K }; typedef double vdouble __attribute__((vector_size(32))); typedef long vlong __attribute__((vector_size(32))); #elif VECLEN == 2 #define VBROADCAST(K) { K, K }; typedef double vdouble __attribute__((vector_size(16))); typedef long vlong __attribute__((vector_size(16))); #endif static const vdouble FALLBACK_THRESHOLD = VBROADCAST(1.0 - 0.001); vdouble sse_sin(vdouble x) { static const vdouble a0 = VBROADCAST(1.0); static const vdouble a1 = VBROADCAST(-1.666666666640169148537065260055e-1); static const vdouble a2 = VBROADCAST( 8.333333316490113523036717102793e-3); static const vdouble a3 = VBROADCAST(-1.984126600659171392655484413285e-4); static const vdouble a4 = VBROADCAST( 2.755690114917374804474016589137e-6); static const vdouble a5 = VBROADCAST(-2.502845227292692953118686710787e-8); static const vdouble a6 = VBROADCAST( 1.538730635926417598443354215485e-10); vdouble xx = x*x; return x*(a0 + xx*(a1 + xx*(a2 + xx*(a3 + xx*(a4 + xx*(a5 + xx*a6)))))); } static inline vlong shiftRight(vlong x, int bits) { #if VECLEN == 4 __m128i lo = (__m128i)_mm256_extractf128_si256((__m256i)x, 0); __m128i hi = (__m128i)_mm256_extractf128_si256((__m256i)x, 1); return (vlong) _mm256_insertf128_si256( _mm256_castsi128_si256(_mm_srli_epi64(lo, bits)), _mm_srli_epi64(hi, bits), 1); #elif VECLEN == 2 return (vlong)_mm_srli_epi64((__m128i)x, bits); #endif } static inline vlong shiftLeft(vlong x, int bits) { #if VECLEN == 4 __m128i lo = (__m128i)_mm256_extractf128_si256((__m256i)x, 0); __m128i hi = (__m128i)_mm256_extractf128_si256((__m256i)x, 1); return (vlong) _mm256_insertf128_si256( _mm256_castsi128_si256(_mm_slli_epi64(lo, bits)), _mm_slli_epi64(hi, bits), 1); #elif VECLEN == 2 return (vlong)_mm_slli_epi64((__m128i)x, bits); #endif } static const vlong EXPONENT_MASK = VBROADCAST(0x7ff0000000000000L); static const vlong EXPONENT_BIAS = VBROADCAST(0x00000000000003ffL); static const vlong EXPONENT_ZERO = VBROADCAST(0x3ff0000000000000L); static const vdouble FIXED_SCALE = VBROADCAST(9.31322574615478515625e-10); // 2^-30 static const vdouble LOG2ERECIP = VBROADCAST(0.6931471805599453094172); static const int EXPONENT_SHIFT = 52; // Extract IEEE754 double exponent as integer. static inline vlong extractExponent(vdouble x) { return shiftRight((vlong)x & EXPONENT_MASK, EXPONENT_SHIFT) - EXPONENT_BIAS; } // Set IEEE754 double exponent to zero. static inline vdouble clearExponent(vdouble x) { return (vdouble)(((vlong)x & ~EXPONENT_MASK) | EXPONENT_ZERO); } // Compute log(x) using SSE2 intrinsics to >= 30 bit precision, except // denorms. vdouble sse_log(vdouble base) { vlong iLog = extractExponent(base); vlong fLog = VBROADCAST(0); base = clearExponent(base); base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; fLog = shiftLeft(fLog, 10); fLog |= extractExponent(base); base = clearExponent(base); base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; fLog = shiftLeft(fLog, 10); fLog |= extractExponent(base); base = clearExponent(base); base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; base = base*base; fLog = shiftLeft(fLog, 10); fLog |= extractExponent(base); // No _mm_cvtepi64_pd() exists so use 32-bit conversion to double. #if VECLEN == 4 __m128i iLogLo = _mm256_extractf128_si256((__m256i)iLog, 0); __m128i iLogHi = _mm256_extractf128_si256((__m256i)iLog, 1); iLogLo = _mm_srli_si128(_mm_shuffle_epi32(iLogLo, 0x80), 8); iLogHi = _mm_slli_si128(_mm_shuffle_epi32(iLogHi, 0x08), 8); __m128i fLogLo = _mm256_extractf128_si256((__m256i)fLog, 0); __m128i fLogHi = _mm256_extractf128_si256((__m256i)fLog, 1); fLogLo = _mm_srli_si128(_mm_shuffle_epi32(fLogLo, 0x80), 8); fLogHi = _mm_slli_si128(_mm_shuffle_epi32(fLogHi, 0x08), 8); vdouble result = _mm256_cvtepi32_pd(iLogHi | iLogLo) + FIXED_SCALE*_mm256_cvtepi32_pd(fLogHi | fLogLo); #elif VECLEN == 2 iLog = (vlong)_mm_shuffle_epi32((__m128i)iLog, 0x8); fLog = (vlong)_mm_shuffle_epi32((__m128i)fLog, 0x8); vdouble result = _mm_cvtepi32_pd((__m128i)iLog) + FIXED_SCALE*_mm_cvtepi32_pd((__m128i)fLog); #endif // Convert from base 2 logarithm and extract result. return LOG2ERECIP*result; } // Original computation. double fallback(double u1, double u2) { double w1 = M_PI*(u1-1/2.0); double w2 = -log(u2); return tan(w1)*(M_PI_2-w1)+log(w2*cos(w1)/(M_PI_2-w1)); } int main() { static const vdouble ZERO = VBROADCAST(0.0) static const vdouble ONE = VBROADCAST(1.0); static const vdouble ONE_HALF = VBROADCAST(0.5); static const vdouble PI = VBROADCAST(M_PI); static const vdouble PI_2 = VBROADCAST(M_PI_2); int i,j; vdouble x = ZERO; for(i = 0; i < 100000000; i += VECLEN) { vdouble u1, u2; for (j = 0; j < VECLEN; ++j) { ((double *)&u1)[j] = mt_drand(); ((double *)&u2)[j] = mt_drand(); } vdouble w1 = PI*(u1 - ONE_HALF); vdouble w2 = -sse_log(u2); vdouble sin_w1 = sse_sin(w1); vdouble sin2_w1 = sin_w1*sin_w1; #if VECLEN == 4 int nearOne = _mm256_movemask_pd(sin2_w1 >= FALLBACK_THRESHOLD); #elif VECLEN == 2 int nearOne = _mm_movemask_pd(sin2_w1 >= FALLBACK_THRESHOLD); #endif if (!nearOne) { #if VECLEN == 4 vdouble cos_w1 = _mm256_sqrt_pd(ONE - sin2_w1); #elif VECLEN == 2 vdouble cos_w1 = _mm_sqrt_pd(ONE - sin2_w1); #endif vdouble tan_w1 = sin_w1/cos_w1; x += tan_w1*(PI_2 - w1) + sse_log(w2*cos_w1/(PI_2 - w1)); } else { vdouble result; for (j = 0; j < VECLEN; ++j) ((double *)&result)[j] = fallback(((double *)&u1)[j], ((double *)&u2)[j]); x += result; } } double sum = 0.0; for (i = 0; i < VECLEN; ++i) sum += ((double *)&x)[i]; printf("%lf\n", sum); return 0; }

I ran into one annoying problem - the approximation error sin about ± pi / 2 can put the value a little outside of [-1, 1] and that (1) invalidates the calculation of tan and (2) causes negative effects when the log argument is close to 0 To avoid this, the code checks to see if sin(w1)^2 "close" to 1, and then it returns to the original full double precision path. The definition of "close" is in FALLBACK_THRESHOLD at the top of the program - I arbitrarily set it to 0.999, which still returns values in the range of the original OP program, but has little effect on performance.

I edited the code to use gcc-specific syntax extensions for SIMD. If your compiler does not have these extensions, you can return to the change history. AVX is now used in the code if the compiler is allowed to process 4 doubles at a time (instead of 2 doubles with SSE2).

Results on my machine without calling mt_seed() to get a repeatable result:

 Version Time Result original 14.653 secs -1917488837.945067 SSE 7.380 secs -1917488837.396841 AVX 6.271 secs -1917488837.422882

It makes sense that the SSE / AVX results are different from the original result due to transient approximations. I think you should be able to adjust FALLBACK_THRESHOLD to compromise accuracy and speed. I'm not sure why the results of SSE and AVX are slightly different.

How to speed up random number creation

More articles: