Finding information to improve code speed

I have a code that will transmit video from the camera to 720p and 24fps. I am trying to capture this stream in code and end up creating its video, dumping compressed jpegs to mjpeg or the like. The problem I am facing is that this generic code is not fast enough to create something at 24 frames per second or 0.04 seconds per image.

using

Stopwatch(); 

I found out that the interval for the cycle takes .000000000022 seconds per cycle.

The equalizer for the loop takes .0000077 seconds to complete the loop.

and I found that the whole function from start to save image is 21 seconds per launch.

from the inner loop to complete the image:

 .000000000022 x 640 = .000000001408 seconds .000000001408 x 360 = .00000050688 seconds 

from the outer loop to complete the image:

 .0000077 x 360 = .002772 seconds 

If I could create an image dating back to the times that I would install, but the code executing the common code takes 21 seconds to complete all the code

 temp_byte1 = main_byte1; temp_byte2 = main_byte2; timer1.Reset(); timer1.Start(); Bitmap mybmp = new Bitmap(1280, 720); BitmapData BPD = mybmp.LockBits(new Rectangle(0, 0, 1280, 720), ImageLockMode.WriteOnly, mybmp.PixelFormat); IntPtr xptr = BPD.Scan0; IntPtr yptr = BPD.Scan0; yptr = new IntPtr( yptr.ToInt64() + (1280 * 720 * 2)); int bytes = Math.Abs(BPD.Stride); byte[][] rgb = new byte[720][]; int Y1, Y2, Y3, Y4, Y5, Y6, Y7, Y8; int U1, U2, V1, V2, U3, U4, V3, V4; for (int one = 0; one < 360; one++) { timer2.Reset(); timer2.Start(); rgb[one] = new byte[bytes]; rgb[360 + one] = new byte[bytes]; for (int two = 0; two < 640; two++) { timer3.Reset(); timer3.Start(); U1 = temp_byte1[one * 2560 + 4 * two + 0]; Y1 = temp_byte1[one * 2560 + 4 * two + 1]; V1 = temp_byte1[one * 2560 + 4 * two + 2]; Y2 = temp_byte1[one * 2560 + 4 * two + 3]; U2 = temp_byte2[one * 2560 + 4 * two + 0]; Y3 = temp_byte2[one * 2560 + 4 * two + 1]; V2 = temp_byte2[one * 2560 + 4 * two + 2]; Y4 = temp_byte2[one * 2560 + 4 * two + 3]; RGB_Conversion(Y1, U1, V1, two * 8 + 0, rgb[one]); RGB_Conversion(Y2, U1, V1, two * 8 + 4, rgb[one]); RGB_Conversion(Y3, U2, V2, two * 8 + 0, rgb[(360 + one)]); RGB_Conversion(Y4, U2, V2, two * 8 + 4, rgb[(360 + one)]); timer3.Stop(); timer3_[two] = timer3.Elapsed; } Marshal.Copy(rgb[one], 0, xptr, 5120); xptr = new IntPtr(xptr.ToInt64() + 5120); Marshal.Copy(rgb[(360 + one)], 0, yptr, 5120); yptr = new IntPtr(yptr.ToInt64() + 5120); timer2.Stop(); timer2_[one] = timer2.Elapsed; } mybmp.UnlockBits(BPD); mybmp.Save(GetDateTimeString("IP Pictures") + ".jpg", ImageFormat.Jpeg); 

the code works and converts the incoming yuv422 byte array to the full jpeg size, but cannot understand why there is such a discrepancy between the for loop speed and the whole code

I moved

 byte[][]rgb = new byte[720]; rgb[x] = new byte[bytes]; 

for the global one, which receives init when the program starts, instead of each function call / does not start a measurable increase in speed.

UPDATE

RGB Conversion: accepts YUV and converts it to RGB and puts it in a global array containing values

 public void RGB_Conversion(int Y, int U, int V, int MULT, byte[] rgb) { int C,D,E; int R,G,B; // create the params for rgb conversion C = Y - 16; D = U - 128; E = V - 128; //R = clamp((298 x C + 409 x E + 128)>>8) //G = clamp((298 x C - 100 x D - 208 x E + 128)>>8) //B = clamp((298 x C + 516 x D + 128)>>8) R = (298 * C + 409 * E + 128)/256; G = (298 * C - 100 * D - 208 * E + 128)/256; B = (298 * C + 516 * D + 128)/256; if (R > 255) R = 255; if (R < 0) R = 0; if (G > 255) G = 255; if (G < 0) G = 0; if (B > 255) B = 255; if (B < 0) B = 0; rgb[MULT + 3] = 255; rgb[MULT + 0] = (byte)B; rgb[MULT + 1] = (byte)G; rgb[MULT + 2] = (byte)R; } 
+6
source share
7 answers

At first

You need to remove the Start / Stop business and stopwatch from the inside of the cycle

Resetting a 640x stopwatch in a narrow loop will distort the numbers. It is better to use a profiler or measure coarse-grained productivity.

In addition, the presence of these operators can interfere with compiler optimization (loop tiles and loop reversals look very good candidates here, but JITTER may not be able to use them, as registers get off to invoke stopwatch functions ..

Data Structures:

I have a feeling that you should be able to use a "flat" data structure, instead of creating all the jagged arrays there. However, I don’t know with which API you feed it, and I didn’t specify it very much.

I feel that creating RGB_Conversion ' just "returns RGB parts instead of letting it write to an array can really give the compiler an edge to optimize things.

Other thoughts:

  • See RGB_Conversion (where / how is it defined?). Perhaps you can pull it in a line.

  • use unchecked block to prevent all array index manipulations to check for overflow

  • consider using / unsafe code ( here ) to avoid border checking

+3
source

There are tons of things you can do:

  • Remove the "new" selection from the outer loop.
  • Predefine and display all buffers
  • Get rid of Marshal.Copy and replace with either an unsafe copy of the word or win32 rtlcopymemory
  • Inline RGB_Conversion
  • Do not call the new IntPtr in the outer loop, but simply increment the pointer to the pinned buffer.

I am sure that there are more, but this is what I saw at first glance. I think it would be better for you to reorganize or rewrite the entire procedure, or perhaps even rewrite it in the C ++. NET library, or at least use unsafe code in the current version to avoid .NET confusion.

+3
source

First, I would make sure that you do not run this in the debugger, otherwise the optimization is completely disabled, and many NOP opcodes are inserted to provide debugger anchor points for curly braces, etc.

Two, you record to disk. It will be fast if it is buffered and very slow at other times if the write causes a flash. This is not CPU usage, which may have killed you here. Could you confirm by launching the task manager and telling us what is the use of your processor?

If you still want to write intermediate JPGs to disk, then I would recommend doing this two streams with a streaming ring queue between them. One of them is the code that you have above that does all the processing; as soon as this is done, it will save the BMP object in the queue and immediately proceed to the next iteration. Thread two will read your BMP objects from the queue and write them to disk.

I would recommend using a blocking queue (or making your own from a queue with a counting semaphore) if the records end up taking longer than frames.

Secondly, do you have a machine that is multi-core? You can continue the calculation. The following is an example of a crude one , since there are many considerations you want to make when taking this approach (much more blocking is required, finding a good implementation of a read-write circular queue, with processing due to order, dealing with large jitter in speed, in which generated JPGs, which leads to the fact that the overall stream has a greater lag, but higher throughput).

Subject A: Reads YUV frames as arrays, from the video source, assigns a serial number to the array, fills the + sn array into queue A.

Thread B, C, D: reads objects from queue A, computes a BMP object, populates a BMP with the same serial number in queue B. Queue B will have BMP objects in random order, for example, 0, 5, 6, 2, 3, 9, 4, ... because you have several messages, but with their serial number you can reorder them later.

Topic E: Reading from Queue B, reordering frames, writing to disk.

All queues, of course, must be thread safe.

One step further, why not get rid of intermediate JPG files? It is a big part of the extra work to write them to disk, just to read them in some other program or at some later stage, and this is probably a performance bottleneck. Why not fully generate the video stream in memory?

Other performance considerations: Do you read "correctly" through your arrays? This is a processor cache problem. The simple answer is: try reversing what the for-loop is internal to see if you get better performance.

Long answer: caching your data works much better if you read bytes in a linear fashion. Let's take an example. You have a rectangular array of 1000x1000, and its linear linearity in the line - the zero line is the first 1000 bytes, the first line is the next, etc. If you read the array column-wise and then row-wise, then you'd read the bytes in the following order: 0, 1000, 2000, ...., 999000, 1, 1001, 2001, ..., 999001, etc. . The processes will not like it, because each read file is on a different page every time, which means more misses in the cache line. You will be candy swept from memory instead of reading linearly.

+2
source

Some thoughts here:

1) Make sure there is no memory allocation. Otherwise, you will receive garbage collection and you will lose data. I think the rest of your code is clean, but I seriously doubt that the jpeg routine is saved. You may need to move part of the code in real time to another language.

2) Themes. I would move this to a stream. Provide a pool of buffers that it can fill, compression and storage are performed in another thread. This allows you to get some stock.

3) The inputs for RGB conversion are effectively 3 bytes. This means that it has 16 million possible input values, and I think it returns uint32 from them. For preliminary calculation it is only 64 Mb. This will remove most of the code from the most critical part of the code and remove 6 branches of the border check.

+1
source

Assuming RGB_Conversion is very fast, I would expect that the main bottleneck here would be the persistence of jpg. If so, try finding a different (faster) jpeg library. Also, be sure to measure how long it takes to create a new Bitmap (1280, 720), and consider reusing a bitmap between frames.

0
source

Have you ever thought of using a parallel task library and a pipeline template to parallelize this code. You can add up the image processing so that the disk records the image N in parallel with the calculation for the image N + 1. This may give you some speedup, but essentially your problem seems to be related to the disk.

Here is an example of using TPL for parallel image processing here, this includes an example application and discussion of trade-offs.

http://msdn.microsoft.com/en-us/library/ff963548.aspx (discussion)

http://parallelpatterns.codeplex.com/releases/view/50473 (code)

I also agree with the comments about using a profiler to measure this. This is likely to be more accurate and will not affect the results.

By the way, I wrote this example in C # and C ++, and C ++ is much faster, mainly because of the direct memory access available to you. If you can combine byte operations into something more, this is likely to give you significant improvements.

0
source

As Ben Jackson noted, color space conversion is completely unnecessary. In short, I did not see a way to save YUV image data in MSDN documents, but the libjpeg library supports working with YUV data (YCbCr), and there is a .NET version at http://bitmiracle.com/libjpeg/

Due to your performance requirements, the libjpeg-turbo library at http://www.libjpeg-turbo.org/ may be the best choice, although using a C-based DLL from C # code can be cumbersome.

0
source

Source: https://habr.com/ru/post/896011/


All Articles