A faster way to check if xmm / ymm is zero?

Nice that PTEST does not affect the carry flag, but only sets (rather inconveniently) ZF. also affects both CF and ZF.

I came up with the following sequence to check a large number of values, but I am not happy with the poor running time.

              Latency / rThoughput
setup:
  xor eax,eax       ; na
  vpxor xmm0,xmm0   ; na       ;mask to use for the nand operation of ptest
work:
  vptest xmm4,xmm0  ; 3   1    ;is xmm4 alive?
  adc eax,eax       ; 1   1    ;move first bit into eax
  vptest xmm5,xmm0  ; 3   1    ;is N alive?
  adc eax,eax       ; 1   1    ;move consecutive bits into eax 

I want to have a bitmap of all non-zero registers in eax(obviously, I can combine multiple bitmaps in multiple registers).

Thus, each test has a delay of 3 + 1 = 4 cycles.
Some of them can work in parallel, alternating between eax, ecxetc.
But it is still pretty slow.
Is there a faster way to do this?

8 xmm/ymm . 1 .

+4
1

, " ", .

, 4 1 , 3 (, movmskb 3). , 8 , , , uop , , , .

, , , Intel, PCMPEQ, , (, PCMPEQQ, 4 1). PCMPEQ, , / , - . 8 , xmm1-8 xmm0 , xmm14 pblendvb , .

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm1, xmm0
vpcmpeqq xmm12, xmm3, xmm0
vpcmpeqq xmm13, xmm5, xmm0
vpcmpeqq xmm14, xmm7, xmm0

# blend the results down into xmm10   word origin
vpblendw xmm10, xmm11, xmm12, 0xAA   # 3131 3131
vpblendw xmm13, xmm13, xmm14, 0xAA   # 7575 7575
vpblendw xmm10, xmm10, xmm13, 0xCC   # 7531 7531

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm2, xmm0
vpcmpeqq xmm12, xmm4, xmm0
vpcmpeqq xmm13, xmm6, xmm0
vpcmpeqq xmm14, xmm8, xmm0

# blend the results down into xmm11   word origin
vpblendw xmm11, xmm11, xmm12, 0xAA   # 4242 4242
vpblendw xmm13, xmm13, xmm14, 0xAA   # 8686 8686
vpblendw xmm11, xmm11, xmm13, 0xCC   # 8642 8642

# blend xmm10 and xmm11 together int xmm100, byte-wise
#         origin bytes
# xmm10 77553311 77553311
# xmm11 88664422 88664422
# res   87654321 87654321 
vpblendvb xmm10, xmm10, xmm11, xmm15

# move the mask bits into eax
vpmovmskb eax, xmm10
and al, ah

, QWORD xmm , 16 8 , xmm10, , ( QWORD QWORD). 16- 16 eax movmskb , , QWORD eax.

, 16 , 8 , 2 . , "". 6 vpblendw, 5 Intel. 4 VPBLENDD, "" , p015. .

. and al, ah - , mov eax, , . , ...

ymm, - eax .

, :

;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
; xmm10 77553311 77553311
; xmm11 88664422 88664422   before shift
; xmm10 07050301 07050301
; xmm11 80604020 80604020   after shift
;result 87654321 87654321   combined
vpsrlw xmm10,xmm10,8
vpsllw xmm11,xmm11,8
vpor xmm10,xmm10,xmm11

;combine the low and high dqword to make sure both are zero. 
vpsrldq xmm12,xmm10,64
vpand xmm10,xmm12
vpmovmskb eax,xmm10

2 , 2 vpblendvb or al,ah, vpmovmskb, .


1 , , Skylake PTEST , , 2. 1 , rcl eax, 1: Agner, Intel, , 3 2 / .

+6

Source: https://habr.com/ru/post/1670177/


All Articles