Vector C # with SIMD using Vector <T> is slower than classic loop
I saw several articles describing how Vector<T>
SIMD is enabled and implemented using JIT functions, so the compiler will correctly output the AVS / SSE / ... commands when using it, allowing you to use the code much faster than the classic, linear contours (example here ) .
I decided to try to rewrite the method that I should see if I can speed things up, but so far I have not succeeded, and the vectorized code works 3 times slower than the original, and I'm not quite sure why. Here are two versions of the method that check if two instances have Span<float>
all pairs of elements in the same position that have the same position relative to the threshold value.
// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
for (int i = 0; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
return true;
}
// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
// Setup the test vector
int l = Vector<float>.Count;
float* arr = stackalloc float[l];
for (int i = 0; i < l; i++)
arr[i] = threshold;
Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
{
// Iterate in chunks
int
div = x1.Length / l,
mod = x1.Length % l,
i = 0,
offset = 0;
for (; i < div; i += 1, offset += l)
{
Vector<float>
v1 = Unsafe.Read<Vector<float>>(px1 + offset),
v1cmp = Vector.GreaterThan<float>(v1, cmp),
v2 = Unsafe.Read<Vector<float>>(px2 + offset),
v2cmp = Vector.GreaterThan<float>(v2, cmp);
float*
pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
for (int j = 0; j < l; j++)
if (pcmp1[j] == 0 != (pcmp2[j] == 0))
return false;
}
// Test the remaining items, if any
if (mod == 0) return true;
for (i = x1.Length - mod; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
}
return true;
}
As I said, I tested both versions using BenchmarkDotNet, and the one that Vector<T>
works is 3 times slower than the other. I tried to run tests at intervals of different lengths (from 100 to 2000), but the vector method continues to be much slower than the other.
Did I miss something obvious here?
Thank!
: , , , , , Parallel.For
-.
, , , , .
public static bool MatchElementwiseThresholdSIMD(ReadOnlySpan<float> x1, ReadOnlySpan<float> x2, float threshold)
{
if (x1.Length != x2.Length) throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<float, Vector<float>>();
var vx2 = x2.NonPortableCast<float, Vector<float>>();
var vthreshold = new Vector<float>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.Xor(v1cmp, v2cmp) != Vector<int>.Zero)
return false;
}
x1 = x1.Slice(Vector<float>.Count * vx1.Length);
x2 = x2.Slice(Vector<float>.Count * vx2.Length);
}
for (var i = 0; i < x1.Length; i++)
if (x1[i] > threshold != x2[i] > threshold)
return false;
return true;
}
, ( , ), , , SIMD...
( ...)
... , , , , , ... , , , (, - , - ...)
public static bool MatchElementwiseThreshold<T>(ReadOnlySpan<T> x1, ReadOnlySpan<T> x2, T threshold)
where T : struct
{
if (x1.Length != x2.Length)
throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<T, Vector<T>>();
var vx2 = x2.NonPortableCast<T, Vector<T>>();
var vthreshold = new Vector<T>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.AsVectorInt32(Vector.Xor(v1cmp, v2cmp)) != Vector<int>.Zero)
return false;
}
// slice them to handling remaining elementss
x1 = x1.Slice(Vector<T>.Count * vx1.Length);
x2 = x2.Slice(Vector<T>.Count * vx1.Length);
}
var comparer = System.Collections.Generic.Comparer<T>.Default;
for (int i = 0; i < x1.Length; i++)
if ((comparer.Compare(x1[i], threshold) > 0) != (comparer.Compare(x2[i], threshold) > 0))
return false;
return true;
}
- . , SIMD- .
System.Numerics.Vector2
https://docs.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types