I made some more improvements for Mike, a good idea and code.
I also made a vectorized version (requiring SSE4.1). Most likely, he has errors, but it's worth a try, because you need to significantly speed up the execution of the multipliers. Porting it to AVX should give another great boost.
See all godbolt code, including vectorized conversion from ASCII to 0..3 base (using pshufb LUT).
My changes:
Do not translate in advance . It should coincide well with the work of the FP cycle, and not make it wait for the tiny translation cycle to complete before the FP starts.
Simplify counter variables (gcc does the best code: it actually kept j in the register rather than optimizing it. Or it completely unwrapped the inner loop in a giant loop.)
Pull the scale (count + 4*alpha) out of the loop completely : instead of dividing by (or multiplying by the inverse), subtract the logarithm. As log () grows extremely slowly, we can probably postpone it indefinitely without losing too much accuracy in the final score .
An alternative can only be the subtraction of all N iterations, but then the loop will have to find out if it ended earlier. At least we could multiply by 1.0 / (count + 4*alpha) instead of dividing. Without -ffast-math compiler cannot do this for you.
Compute pwcluster for us: it probably computes it for our own use, and we can remove one of the args ( cluster ) functions.
row_offset did a bit worse code compared to just writing i*cols . If you like pointer increments as an alternative to array indexing, gcc does even better code in the inner loop, incrementing pwcluster directly.
rename l to len : single-letter variable names are bad style, except in very small areas. (for example, a loop or a very small function that does only one thing), and even if there is no good short but meaningful name. for example p does not make sense than ptr , but len tells you what it means, and not just what it is.
Further observations:
Saving sequences in the translated format throughout your program would be better for this and any other code that wants to use the DNA base as an array index or counter.
You can also vectorize the translation of nucleotide numbers (0..3) to / from ASCII using SSSE3 pshufb. (See My Godbolt Code).
Saving your matrix in float instead of int might be better. Since your code spends most of the time in this function now, it will work faster if it does not need to convert from int to float. On Haswell, cvtss2sd (single-> double) seems to have better bandwidth than ctvsi2sd (int-> double), but not on Skylake. (ss2sd is slower on SKL than HSW).
Storing your matrix in double format can be faster, but double cache size can be a killer. Performing this calculation using float instead of double will also avoid conversion costs. But you can defer log() for more iterations with double .
Using several product variables ( p1 , p2 , etc.) in a manually deployed inner loop, you will see more parallelism. Multiply them together at the end of the cycle. (I ended up creating a vectorized version with two vector batteries).
For Skylake, or perhaps Broadwell, you can VPGATHERDD using VPGATHERDD . A vectorized translation from ASCII to 0..3 will be useful here.
Even without using a collection command, loading two integers into a vector and using a packed conversion command would be nice. Packaged conversion commands are faster than scalar ones. We have a lot of effort, and we can, of course, use two or four simultaneously with SIMD vectors. See below.
Simple version with my improvements:
See the full godbolt code, link at the top of this answer.
double cal_score_simple( int len // one-letter variable names are only good in the smallest scopes, like a loop , const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value , const int *__restrict__ pwcluster // have the caller do the address math for us, since it probably already does it anyway , double alpha ) { // note that __restrict__ isn't needed because we don't write into any pointers const int cols = 4; const int logdelay_factor = 4; // accumulate products for this many iterations before doing a log() int count=0; // count the first row of pwcluster for(int k = 0; k < cols; k++) count += pwcluster[k]; const double log_c4a = log(count + 4*alpha); double score = 0; for (int i = 0; i < len;){ double product = 1; int inner_bound = std::min(len, i+logdelay_factor); while (i < inner_bound){ unsigned int k = transTable[seq[i]]; // translate on the fly product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred // TODO: unroll this with two or four product accumulators to allow parallelism i++; } score += log(product); // - log_c4a * j; } score -= log_c4a * len; // might be ok to defer this subtraction indefinitely, since log() accumulates very slowly return score; }
This compiles to a pretty good asm, with a pretty compact inner loop:
.L6: movzx esi, BYTE PTR [rcx] # D.74129, MEM[base: _127, offset: 0B] vxorpd xmm1, xmm1, xmm1 # D.74130 add rcx, 1 # ivtmp.44, movzx esi, BYTE PTR transTable[rsi] # k, transTable add esi, eax # D.74133, ivtmp.45 add eax, 4 # ivtmp.45, vcvtsi2sd xmm1, xmm1, DWORD PTR [r12+rsi*4] # D.74130, D.74130, *_38 vaddsd xmm1, xmm1, xmm2 # D.74130, D.74130, alpha vmulsd xmm0, xmm0, xmm1 # product, product, D.74130 cmp eax, r8d # ivtmp.45, D.74132 jne .L6 #,
Using the increment of the pointer instead of indexing with i*cols removes one add from the loop, bringing it to 10 fops-domain uops (versus 11 in this loop). Thus, for the interface throughput from the loop buffer, this does not matter, but there will be less deletions for the execution ports. Resource stalls can solve this problem , even if the overall uop bandwidth is not a bottleneck.
Manual vector version of SSE:
Not verified, and not so carefully written. I could easily make some mistakes here. If you use this on computers with AVX, you must definitely make an AVX version. Use vextractf128 as the first step in a horizontal product or sum, the same as mine here.
Using the vectorized function log() to compute two (or four with AVX) log() obtained in parallel in the vector, you can simply make a horizontal sum at the end, instead of more frequent horizontal products before each scalar log() . I am sure that someone wrote one, but I am not going to waste time searching for it right now.
// TODO: AVX version double cal_score_SSE( int len // one-letter variable names are only good in the smallest scopes, like a loop , const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value , const int *__restrict__ pwcluster // have the caller do the address math for us, since it probably already does it anyway , double alpha ) { const int cols = 4; const int logdelay_factor = 16; // accumulate products for this many iterations before doing a log() int count=0; // count the first row of pwcluster for(int k = 0; k < cols; k++) count += pwcluster[k]; //const double count4alpha_inverse = 1.0 / (count + 4*alpha); const double log_c4a = log(count + 4*alpha);
ASCII-> integer vectorized database
Ideally, do the translation once when you read in a sequence, and internally store them in 0/1/2/3 arrays instead of ASCII A / C / G / T lines.
It can be manually vectorized using pshufb if we do not need to check for errors (invalid characters). In the Mike code, where we translate the entire input before the FP loop, this can lead to a large speedup of this part of the code.
To translate "on the fly" we can use the vector for:
- translate a block of 16 input characters into an external loop,
- save it to a 16-byte buffer,
- then scalar loads from this in the inner loop.
Since gcc seems to fully expand the vector loop, this will replace 16 movzx instructions with 6 vector instructions (including loading and saving).