A faster way to calculate the probability of a sequence?

This is my second question from the previous one. Faster way to add a multidimensional matrix? Follow @Peter Cordes’s advice, I’ve outlined my code, and now the speed has been increased by 50X. Then I ran gprof again and found that this function took up most of the time.

 Each sample counts as 0.01 seconds.
   % cumulative self self total           
  time seconds seconds calls Ts / call Ts / call name    
  69.97 1.53 1.53 cal_score (int, std :: string, int const *, int, double)
double cal_score(int l, string seq, const int *__restrict__ pw,int cluster,double alpha) { const int cols =4; const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols; double score = 0; char s; string alphabet="ACGT"; int count=0; for(int k=0;k<cols;k++) count=count+pwcluster[k]; for (int i = 0; i < l; i++){ long row_offset = cols*i; s=seq[i]; //#pragma omp simd for(int k=0;k<cols;k++) { if (s==alphabet[k]) score=score+log( ( pwcluster[row_offset+k]+alpha )/(count+4*alpha) ); } } return score; } 

I am doing code optimization for the first time, so I don’t know how to do it. So, is there a better way to write this function. Therefore, I can get even more speed. Input seq is a sequence of ACGT characters of length l. pw is a one-dimensional array of size 2 * l * 4 or [p] [q] [r], and the cluster is p.

+2
c ++ optimization string gprof
May 16 '16 at 12:31
source share
2 answers

Here is another way to rewrite it. This translates the line with the lookup table instead of the search and calls log less than 10 times.

This also changes seq to const char* , passed by reference, instead of std::string , passed by value. (This will copy the entire line).

 unsigned char transTable[128]; void InitTransTable(){ memset(transTable, 0, sizeof(transTable)); transTable['A'] = 0; transTable['C'] = 1; transTable['G'] = 2; transTable['T'] = 3; } static int tslen = 0; // static instead of global lets the compiler keep tseq in a register inside the loop static unsigned char* tseq = NULL; // reusable buffer for translations. Not thread-safe double cal_score( int l , const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value , const int *__restrict__ pw , int cluster , double alpha ) { int i, j, k; // make sure tseq is big enough if (tseq == NULL){ tslen = std::max(4096, l+1024); tseq = new unsigned char[tslen]; memset(tseq, 0, tslen); } else if (l > tslen-1){ delete tseq; tslen = l + 4096; tseq = new unsigned char[tslen]; memset(tseq, 0, tslen); } // translate seq into tseq // (decrementing i so the beginning of tseq will be hot in cache when we're done) for (i = l; --i >= 0;) tseq[i] = transTable[seq[i]]; const int cols = 4; const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols; double score = 0; // count up pwcluster int count=0; for(k = 0; k < cols; k++) count += pwcluster[k]; double count4alpha = (count + 4*alpha); long row_offset = 0; for (i = 0; i < l;){ double product = 1; for (j = 0; j < 10 && i < l; j++, i++, row_offset += cols){ k = tseq[i]; product *= (pwcluster[row_offset + k] + alpha) / count4alpha; } score += log(product); } return score; } 

This compiles into pretty good code , but without -ffast-math division cannot be replaced by multiplication.

This is not auto-vectorization, because we only load one of the four pwcluster elements.

+1
May 16 '16 at 19:03
source share

I made some more improvements for Mike, a good idea and code.

I also made a vectorized version (requiring SSE4.1). Most likely, he has errors, but it's worth a try, because you need to significantly speed up the execution of the multipliers. Porting it to AVX should give another great boost.

See all godbolt code, including vectorized conversion from ASCII to 0..3 base (using pshufb LUT).




My changes:

  • Do not translate in advance . It should coincide well with the work of the FP cycle, and not make it wait for the tiny translation cycle to complete before the FP starts.

  • Simplify counter variables (gcc does the best code: it actually kept j in the register rather than optimizing it. Or it completely unwrapped the inner loop in a giant loop.)

  • Pull the scale (count + 4*alpha) out of the loop completely : instead of dividing by (or multiplying by the inverse), subtract the logarithm. As log () grows extremely slowly, we can probably postpone it indefinitely without losing too much accuracy in the final score .

    An alternative can only be the subtraction of all N iterations, but then the loop will have to find out if it ended earlier. At least we could multiply by 1.0 / (count + 4*alpha) instead of dividing. Without -ffast-math compiler cannot do this for you.

  • Compute pwcluster for us: it probably computes it for our own use, and we can remove one of the args ( cluster ) functions.

  • row_offset did a bit worse code compared to just writing i*cols . If you like pointer increments as an alternative to array indexing, gcc does even better code in the inner loop, incrementing pwcluster directly.

  • rename l to len : single-letter variable names are bad style, except in very small areas. (for example, a loop or a very small function that does only one thing), and even if there is no good short but meaningful name. for example p does not make sense than ptr , but len tells you what it means, and not just what it is.




Further observations:

  • Saving sequences in the translated format throughout your program would be better for this and any other code that wants to use the DNA base as an array index or counter.

    You can also vectorize the translation of nucleotide numbers (0..3) to / from ASCII using SSSE3 pshufb. (See My Godbolt Code).

  • Saving your matrix in float instead of int might be better. Since your code spends most of the time in this function now, it will work faster if it does not need to convert from int to float. On Haswell, cvtss2sd (single-> double) seems to have better bandwidth than ctvsi2sd (int-> double), but not on Skylake. (ss2sd is slower on SKL than HSW).

    Storing your matrix in double format can be faster, but double cache size can be a killer. Performing this calculation using float instead of double will also avoid conversion costs. But you can defer log() for more iterations with double .

  • Using several product variables ( p1 , p2 , etc.) in a manually deployed inner loop, you will see more parallelism. Multiply them together at the end of the cycle. (I ended up creating a vectorized version with two vector batteries).

  • For Skylake, or perhaps Broadwell, you can VPGATHERDD using VPGATHERDD . A vectorized translation from ASCII to 0..3 will be useful here.

  • Even without using a collection command, loading two integers into a vector and using a packed conversion command would be nice. Packaged conversion commands are faster than scalar ones. We have a lot of effort, and we can, of course, use two or four simultaneously with SIMD vectors. See below.




Simple version with my improvements:

See the full godbolt code, link at the top of this answer.

 double cal_score_simple( int len // one-letter variable names are only good in the smallest scopes, like a loop , const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value , const int *__restrict__ pwcluster // have the caller do the address math for us, since it probably already does it anyway , double alpha ) { // note that __restrict__ isn't needed because we don't write into any pointers const int cols = 4; const int logdelay_factor = 4; // accumulate products for this many iterations before doing a log() int count=0; // count the first row of pwcluster for(int k = 0; k < cols; k++) count += pwcluster[k]; const double log_c4a = log(count + 4*alpha); double score = 0; for (int i = 0; i < len;){ double product = 1; int inner_bound = std::min(len, i+logdelay_factor); while (i < inner_bound){ unsigned int k = transTable[seq[i]]; // translate on the fly product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred // TODO: unroll this with two or four product accumulators to allow parallelism i++; } score += log(product); // - log_c4a * j; } score -= log_c4a * len; // might be ok to defer this subtraction indefinitely, since log() accumulates very slowly return score; } 

This compiles to a pretty good asm, with a pretty compact inner loop:

 .L6: movzx esi, BYTE PTR [rcx] # D.74129, MEM[base: _127, offset: 0B] vxorpd xmm1, xmm1, xmm1 # D.74130 add rcx, 1 # ivtmp.44, movzx esi, BYTE PTR transTable[rsi] # k, transTable add esi, eax # D.74133, ivtmp.45 add eax, 4 # ivtmp.45, vcvtsi2sd xmm1, xmm1, DWORD PTR [r12+rsi*4] # D.74130, D.74130, *_38 vaddsd xmm1, xmm1, xmm2 # D.74130, D.74130, alpha vmulsd xmm0, xmm0, xmm1 # product, product, D.74130 cmp eax, r8d # ivtmp.45, D.74132 jne .L6 #, 

Using the increment of the pointer instead of indexing with i*cols removes one add from the loop, bringing it to 10 fops-domain uops (versus 11 in this loop). Thus, for the interface throughput from the loop buffer, this does not matter, but there will be less deletions for the execution ports. Resource stalls can solve this problem , even if the overall uop bandwidth is not a bottleneck.




Manual vector version of SSE:

Not verified, and not so carefully written. I could easily make some mistakes here. If you use this on computers with AVX, you must definitely make an AVX version. Use vextractf128 as the first step in a horizontal product or sum, the same as mine here.

Using the vectorized function log() to compute two (or four with AVX) log() obtained in parallel in the vector, you can simply make a horizontal sum at the end, instead of more frequent horizontal products before each scalar log() . I am sure that someone wrote one, but I am not going to waste time searching for it right now.

 // TODO: AVX version double cal_score_SSE( int len // one-letter variable names are only good in the smallest scopes, like a loop , const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value , const int *__restrict__ pwcluster // have the caller do the address math for us, since it probably already does it anyway , double alpha ) { const int cols = 4; const int logdelay_factor = 16; // accumulate products for this many iterations before doing a log() int count=0; // count the first row of pwcluster for(int k = 0; k < cols; k++) count += pwcluster[k]; //const double count4alpha_inverse = 1.0 / (count + 4*alpha); const double log_c4a = log(count + 4*alpha); #define COUNTER_TYPE int //// HELPER FUNCTION: make a vector of two (pwcluster[i*cols + k] + alpha) auto lookup_two_doublevec = [&pwcluster, &seq, &alpha](COUNTER_TYPE pos) { unsigned int k0 = transTable[seq[pos]]; unsigned int k1 = transTable[seq[pos+1]]; __m128i pwvec = _mm_cvtsi32_si128( pwcluster[cols*pos + k0] ); pwvec = _mm_insert_epi32(pwvec, pwcluster[cols*(pos+1) + k1], 1); // for AVX: repeat the previous lines, and _mm_unpack_epi32 into one __m128i, // then use _mm256_cvtepi32_pd (__m128i src) __m128d alphavec = _mm_set1_pd(alpha); return _mm_cvtepi32_pd(pwvec) + alphavec; //p1d = _mm_add_pd(p1d, _mm_set1_pd(alpha)); }; double score = 0; for (COUNTER_TYPE i = 0; i < len;){ double product = 1; COUNTER_TYPE inner_bound = i+logdelay_factor; if (inner_bound >= len) inner_bound = len; // possibly do a whole vector of transTable translations; probably doesn't matter if (likely(inner_bound < len)) { // We can do 8 or 16 elements without checking the loop counter __m128d p1d = lookup_two_doublevec(i+0); __m128d p2d = lookup_two_doublevec(i+2); i+=4; // start with four element loaded into two vectors, not multiplied by anything static_assert(logdelay_factor % 4 == 0, "logdelay_factor must be a multiple of 4 for vectorization"); while (i < inner_bound) { // The *= syntax requires GNU C vector extensions, which is how __m128d is defined in gcc p1d *= lookup_two_doublevec(i+0); p2d *= lookup_two_doublevec(i+2); i+=4; } // we have two vector accumulators, holding two products each p1d *= p2d; // combine to one vector //p2d = _mm_permute_pd(p1d, 1); // if you have AVX. It no better than movhlps, though. // movhlps p2d, p1d // extract the high double, using p2d as a temporary p2d = _mm_castps_pd( _mm_movehl_ps(_mm_castpd_ps(p2d), _mm_castpd_ps(p1d) ) ); p1d = _mm_mul_sd(p1d, p2d); // multiply the last two elements, now that we have them extracted to separate vectors product = _mm_cvtsd_f64(p1d); // TODO: find a vectorized log() function for use here, and do a horizontal add down to a scalar outside the outer loop. } else { // Scalar for the last unknown number of iterations while (i < inner_bound){ unsigned int k = transTable[seq[i]]; product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred i++; } } score += log(product); // - log_c4a * j; // deferred } score -= log_c4a * len; // May be ok to defer this subtraction indefinitely, since log() accumulates very slowly // if not, subtract log_c4a * logdefer_factor in the vector part, // and (len&15)*log_c4a out here at the end. (ie len %16) return score; } 



ASCII-> integer vectorized database

Ideally, do the translation once when you read in a sequence, and internally store them in 0/1/2/3 arrays instead of ASCII A / C / G / T lines.

It can be manually vectorized using pshufb if we do not need to check for errors (invalid characters). In the Mike code, where we translate the entire input before the FP loop, this can lead to a large speedup of this part of the code.

To translate "on the fly" we can use the vector for:

  • translate a block of 16 input characters into an external loop,
  • save it to a 16-byte buffer,
  • then scalar loads from this in the inner loop.

Since gcc seems to fully expand the vector loop, this will replace 16 movzx instructions with 6 vector instructions (including loading and saving).

 #include <immintrin.h> __m128i nucleotide_ASCII_to_number(__m128i input) { // map A->0, C->1, G->2, T->3. // low 4 bits aren't unique low 4 bits *are* unique /* 'A' = 65 = 0b100 0001 >>1 : 0b10 0000 * 'C' = 67 = 0b100 0011 >>1 : 0b10 0001 * 'G' = 71 = 0b100 0111 >>1 : 0b10 0011 * 'T' = 87 = 0b101 0111 >>1 : 0b10 1011 // same low 4 bits for lower-case * * We right-shift by one, mask, and use that as indices into a LUT * We can use pshufb as a 4bit LUT, to map all 16 chars in parallel */ __m128i LUT = _mm_set_epi8(0xff, 0xff, 0xff, 0xff, 3, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 2, 0xff, 1, 0); // Not all "bogus" characters map to 0xFF, but 0xFF in the output only happens on invalid input __m128i shifted = _mm_srli_epi32(input, 1); // And then mask, to emulate srli_epi8 __m128i masked = _mm_and_si128(shifted, _mm_set1_epi8(0x0F)); __m128i nucleotide_codes = _mm_shuffle_epi8(LUT, masked); return nucleotide_codes; } // compiles to: vmovdqa xmm1, XMMWORD PTR .LC2[rip] # the lookup table vpsrld xmm0, xmm0, 1 # tmp96, input, vpand xmm0, xmm0, XMMWORD PTR .LC1[rip] # D.74111, tmp96, vpshufb xmm0, xmm1, xmm0 # tmp100, tmp101, D.74111 ret 
+1
May 17 '16 at 15:21
source share



All Articles