Critical LockBits Performance Code

I have a method that should be as fast as possible, it uses unsafe memory pointers and its first foray into this type of encoding, so I know that it can be faster.

/// <summary> /// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata /// </summary> /// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param> /// <param name="destbtmpdata"></param> /// <param name="point">The point on the destination bitmap to draw at</param> private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) { // calculate total number of rows to draw. var totalRow = Math.Min( destbtmpdata.Height - point.Y, sourcebtmpdata.Height); //loop through each row on the source bitmap and get mem pointers //to the source bitmap and dest bitmap for (int i = 0; i < totalRow; i++) { int destRow = point.Y + i; //get the pointer to the start of the current pixel "row" on the output image byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride); //get the pointer to the start of the FIRST pixel row on the source image byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride); int pointX = point.X; //the rowSize is pre-computed before the loop to improve performance int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width); //for each row each set each pixel for (int j = 0; j < rowSize; j++) { int firstBlueByte = ((pointX + j)*3); int srcByte = j *3; destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte]; destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1]; destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2]; } } } 

So what can be done to make it faster? Ignore todo for now, it’s bad to fix it later, as soon as I have some baseline performance measurements.

UPDATE: Sorry, I should have mentioned that the reason I use this instead of Graphics.DrawImage is that im implements multithreading and because of this I cannot use DrawImage.

UPDATE 2: I'm still not happy with the performance, and I'm sure there are a few more ms that can be executed.

+3
source share
10 answers

There was something fundamentally wrong in the code that I can’t believe that I still haven’t noticed.

 byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride); 

This gets a pointer to the destination line, but does not get the column that it copies, which in the old code runs inside the rowSize loop. Now it looks like this:

 byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3; 

So now we have the correct pointer to the destination data. Now we can get rid of this cycle. Using the Vilx- and Rob prompts, the code now looks like this:

  private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) { //calculate total number of rows to copy. //using ternary operator instead of Math.Min, few ms faster int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height; //calculate the width of the image to draw, this cuts off the image //if it goes past the width of the destination image int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width; //loop through each row on the source bitmap and get mem pointers //to the source bitmap and dest bitmap for (int i = 0; i < totalRows; i++) { int destRow = point.Y + i; //get the pointer to the start of the current pixel "row" and column on the output image byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3; //get the pointer to the start of the FIRST pixel row on the source image byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride); //RtlMoveMemory function CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3); } } 

Copying a 500 Γ— 500 image to a 5000x5000 image in a grid 50 times took: 00: 00: 07.9948993 sec. Now with the changes above it takes 00: 00: 01.8714263 sec. Much better.

+4
source

Well ... I'm not sure if the .NET bitmap formats are fully compatible with the Windows GDI32 features ...

But one of the first few Win32 APIs that I learned was BitBlt:

 BOOL BitBlt( HDC hdcDest, int nXDest, int nYDest, int nWidth, int nHeight, HDC hdcSrc, int nXSrc, int nYSrc, DWORD dwRop ); 

And that was a quick way to copy data, if I remember correctly.

Here's the signature of BitBlt PInvoke for use in C # and the corresponding usage information, great for anyone working with high-performance graphics in C #:

Definitely worth a look.

+2
source

The inner loop is where you want to concentrate a lot of time (but take measurements to make sure)

 for (int j = 0; j < sourcebtmpdata.Width; j++) { destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3]; destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1]; destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2]; } 
  • Get rid of the multiplications and indexing of the array (which is multiplied under the hoods) and replace with the pointer that you are increasing.

  • Same as +1, +2, draw a pointer.

  • Your compiler will probably not continue to compute point.X (check), but just in case, make a local variable. He will not do this in a single iteration, but it can be every iteration.

+1
source

You can watch Eigen .

This is a C ++ template library that uses the SSE (2 and later) and AltiVec commands with graceful rollback for non-vectorized code.

Fast. (See Benchmark).
Expression templates allow you to intelligently delete temporary files and give a lazy rating when appropriate - Eigen will take care of this automatically and in most cases will handle aliasing.
Explicit vectorization is performed for SSE instruction sets (2 and later) and AltiVec with graceful rollback from non-vectorized code. Expression patterns allow these optimizations to be performed globally for entire expressions.
When using fixed-size objects, dynamic memory allocation is eliminated, and loops are deployed when it makes sense.
For large matrices, special attention is paid to cache compatibility.

You can implement your function in C ++ and then call it from C #

+1
source

You do not always need to use pointers to get good speed. This should be within a few ms of the original:

  private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) { byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3]; int maximum = src.Length; byte[] dest = new byte[maximum]; Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length); int pointX = point.X * 3; int copyLength = destbtmpdata.Width*3 - pointX; int k = pointX + point.Y * sourcebtmpdata.Stride; int rowWidth = sourcebtmpdata.Stride; while (k<maximum) { Array.Copy(src,k,dest,k,copyLength); k += rowWidth; } Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length); } 
+1
source

Unfortunately, I do not have time to write a complete solution, but I would consider using the RtlMoveMemory () platform to move strings in general, rather than bytes. It should be much faster.

+1
source

I think that the step size and the number of line numbers can be calculated in advance.

And I previously calculated all the multiplications, as a result of which the following code appeared:

 private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point) { //TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future const int pixelSize = 3; // calculate total number of rows to draw. var totalRow = Math.Min( destbtmpdata.Height - point.Y, sourcebtmpdata.Height); var rowSize = Math.Min( (destbtmpdata.Width - point.X) * pixelSize, sourcebtmpdata.Width * pixelSize); // starting point of copy operation byte* srcPtr = (byte*)sourcebtmpdata.Scan0; byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride; // loop through each row for (int i = 0; i < totalRow; i++) { // draw the entire row for (int j = 0; j < rowSize; j++) destPtr[point.X + j] = srcPtr[j]; // advance each pointer by 1 row destPtr += destbtmpdata.Stride; srcPtr += sourcebtmpdata.Stride; } } 

Not tested it completely, but you should be able to get this to work.

I removed the multiplication operations from the loop (pre-calculated instead) and removed most of the branches so that it would be slightly faster.

Let me know if this helps :-)

0
source

I am looking at your code in C # and I cannot find out anything familiar. All this is like a ton of C ++. By the way, it looks like DirectX / XNA should be your new friend. Only my 2 cents. Do not kill the messenger.

If you have to rely on the CPU to do this: I did some optimizations of the 24-bit layout myself, and I can tell you that memory access speed should be your bottleneck. Use SSE3 instructions for quick access through quick access. This means C ++ and the built-in assembler language. In pure C, you will be 30% slower on most machines.

Keep in mind that modern GPUs are much faster than CPUs in such operations.

0
source

I'm not sure if this will give extra performance, but I see the pattern in Reflector.

So:

 int srcByte = j *3; destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte]; destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1]; destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2]; 

becomes:

 *destRowPtr++ = *srcRowPtr++; *destRowPtr++ = *srcRowPtr++; *destRowPtr++ = *srcRowPtr++; 

Probably more braces are required.

If the width is fixed, you can probably expand the entire line into several hundred lines. :)

Update

You can also try using a larger type, such as Int32 or Int64, to improve performance.

0
source

Well, that will be pretty close to the line, how many ms you can exit the algorithm, but get rid of the Math.Min call and replace it with the trinary operator instead.

Generally, making a library call will take longer than doing something yourself, and I made a simple test driver to confirm this for Math.Min.

 using System; using System.Diagnostics; namespace TestDriver { class Program { static void Main(string[] args) { // Start the stopwatch if (Stopwatch.IsHighResolution) { Console.WriteLine("Using high resolution timer"); } else { Console.WriteLine("High resolution timer unavailable"); } // Test Math.Min for 10000 iterations Stopwatch sw = Stopwatch.StartNew(); for (int ndx = 0; ndx < 10000; ndx++) { int result = Math.Min(ndx, 5000); } Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000")); // Test trinary operator for 10000 iterations sw = Stopwatch.StartNew(); for (int ndx = 0; ndx < 10000; ndx++) { int result = (ndx < 5000) ? ndx : 5000; } Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000")); Console.ReadKey(); } } } 

The results when running above on my computer are Intel T2400 @ 1.83 GHz. Also, note that there is a little variation in the results, but as a rule, the trine operator is faster by about 0.01 ms. This is not so much, but it will be added up over a sufficiently large data set.

Using a high resolution timer
0.0539
0,0402

0
source

Source: https://habr.com/ru/post/1335315/


All Articles