Designing a quick roll window file reader

I am writing an algorithm in C ++ that scans a file with a "sliding window", that is, scans bytes 0 to n, do something, then scans bytes 1 to n + 1, do something, etc. until the end is reached.

My first algorithm was to read the first n bytes, do something, discard one byte, read a new byte, and repeat. It was very slow because the ReadFile from the HDD one byte at a time was inefficient. (About 100 kB / s)

My second algorithm involves reading a fragment of a file (possibly n * 1000 bytes, which means the entire file, if it is not too large), into the buffer and reading individual bytes from the buffer. Now I get about 10 MB / s (decent SSD + Core i5, 1.6 GHz laptop).

My question is: do you have any suggestions for faster models?

edit : My large buffer (relative to window size) is implemented as follows:
- for a scan window of 5 kB, the buffer is initialized to 5 MB
- read the first 5 MB of the file in the buffer
- window pointer starts at the beginning of the buffer
- when moving, the window pointer increases - when the window pointer approaches the end of the 5 MB buffer (say, 4.99 MB), copy the remaining 0.01 MB to the beginning of the buffer, reset the window pointer to the beginning and read an additional 4.99 MB to the buffer. - repeat

edit 2 - actual implementation (deleted)

Thanks, for many insightful answers. It was hard to choose the "best answer"; they were all excellent and helped with my encoding.

+4
source share
3 answers

I use a sliding window in one of my applications (in fact, several layers of sliding windows work on top of each other, but this is beyond the scope of this discussion). The window uses the view of the file with memory mapping via CreateFileMapping() and MapViewOfFile() , then I have an abstraction layer on top of this. I set the abstraction level for any range of bytes that I need, and this ensures that file matching and file browsing are adjusted accordingly so that these bytes are in memory. Each time a new range of bytes is requested, a file view is only configured if necessary.

The view of the file is positioned and scaled at the borders of the pages, which are even a multiple of the granularity of the system, as GetSystemInfo() reports. Just because a scan reaches the end of a given range of bytes does not necessarily mean that it has reached the boundary of the end of the page, so for the next scan, you may not need to change the appearance of the file at all, the next bytes are already in memory. If the first requested byte of the range exceeds the right border of the displayed page, the left edge of the file view is adjusted to the left border of the requested page, and all pages on the left are not displayed. If the last requested byte in the range exceeds the right border of the most displayed page, a new page is displayed and added to the file view.

This sounds more complicated than it really is to implement it after encoding it:

Creating a view in a file

It looks like you are scanning bytes in fixed-size blocks, so this approach is very fast and very efficient for this. Based on this method, I can sequentially scan files with several GIGBYTEs from start to finish pretty quickly, usually a minute or less on my slowest machine. If your files are smaller than the system dimension or even a few megabytes, you are unlikely to notice that some time has passed (unless your scans are slow).

Update : here is a simplified variation of what I'm using:

 class FileView { private: DWORD m_AllocGran; DWORD m_PageSize; HANDLE m_File; unsigned __int64 m_FileSize; HANDLE m_Map; unsigned __int64 m_MapSize; LPBYTE m_View; unsigned __int64 m_ViewOffset; DWORD m_ViewSize; void CloseMap() { CloseView(); if (m_Map != NULL) { CloseHandle(m_Map); m_Map = NULL; } m_MapSize = 0; } void CloseView() { if (m_View != NULL) { UnmapViewOfFile(m_View); m_View = NULL; } m_ViewOffset = 0; m_ViewSize = 0; } bool EnsureMap(unsigned __int64 Size) { // do not exceed EOF or else the file on disk will grow! Size = min(Size, m_FileSize); if ((m_Map == NULL) || (m_MapSize != Size)) { // a new map is needed... CloseMap(); ULARGE_INTEGER ul; ul.QuadPart = Size; m_Map = CreateFileMapping(m_File, NULL, PAGE_READONLY, ul.HighPart, ul.LowPart, NULL); if (m_Map == NULL) return false; m_MapSize = Size; } return true; } bool EnsureView(unsigned __int64 Offset, DWORD Size) { if ((m_View == NULL) || (Offset < m_ViewOffset) || ((Offset + Size) > (m_ViewOffset + m_ViewSize))) { // the requested range is not already in view... // round down the offset to the nearest allocation boundary unsigned __int64 ulNewOffset = ((Offset / m_AllocGran) * m_AllocGran); // round up the size to the next page boundary DWORD dwNewSize = ((((Offset - ulNewOffset) + Size) + (m_PageSize-1)) & ~(m_PageSize-1)); // if the new view will exceed EOF, truncate it unsigned __int64 ulOffsetInFile = (ulNewOffset + dwNewSize); if (ulOffsetInFile > m_FileSize) dwNewViewSize -= (ulOffsetInFile - m_FileSize); if ((m_View == NULL) || (m_ViewOffset != ulNewOffset) || (m_ViewSize != ulNewSize)) { // a new view is needed... CloseView(); // make sure the memory map is large enough to contain the entire view if (!EnsureMap(ulNewOffset + dwNewSize)) return false; ULARGE_INTEGER ul; ul.QuadPart = ulNewOffset; m_View = (LPBYTE) MapViewOfFile(m_Map, FILE_MAP_READ, ul.HighPart, ul.LowPart, dwNewSize); if (m_View == NULL) return false; m_ViewOffset = ulNewOffset; m_ViewSize = dwNewSize; } } return true; } public: FileView() : m_AllocGran(0), m_PageSize(0), m_File(INVALID_HANDLE_VALUE), m_FileSize(0), m_Map(NULL), m_MapSize(0), m_View(NULL), m_ViewOffset(0), m_ViewSize(0) { // map views need to be positioned on even multiples // of the system allocation granularity. let size // them on even multiples of the system page size... SYSTEM_INFO si = {0}; if (GetSystemInfo(&si)) { m_AllocGran = si.dwAllocationGranularity; m_PageSize = si.dwPageSize; } } ~FileView() { CloseFile(); } bool OpenFile(LPTSTR FileName) { CloseFile(); if ((m_AllocGran == 0) || (m_PageSize == 0)) return false; HANDLE hFile = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL); if (hFile == INVALID_HANDLE_VALUE) return false; ULARGE_INTEGER ul; ul.LowPart = GetFileSize(hFile, &ul.HighPart); if ((ul.LowPart == INVALID_FILE_SIZE) && (GetLastError() != 0)) { CloseHandle(hFile); return false; } m_File = hFile; m_FileSize = ul.QuadPart; return true; } void CloseFile() { CloseMap(); if (m_File != INVALID_HANDLE_VALUE) { CloseHandle(m_File); m_File = INVALID_HANDLE_VALUE; } m_FileSize = 0; } bool AccessBytes(unsigned __int64 Offset, DWORD Size, LPBYTE *Bytes, DWORD *Available) { if (Bytes) *Bytes = NULL; if (Available) *Available = 0; if ((m_FileSize != 0) && (offset < m_FileSize)) { // make sure the requested range is in view if (!EnsureView(Offset, Size)) return false; // near EOF, the available bytes may be less than requested DWORD dwOffsetInView = (Offset - m_ViewOffset); if (Bytes) *Bytes = &m_View[dwOffsetInView]; if (Available) *Available = min(m_ViewSize - dwOffsetInView, Size); } return true; } }; 

.

 FileView fv; if (fv.OpenFile(TEXT("C:\\path\\file.ext"))) { LPBYTE data; DWORD len; unsigned __int64 offset = 0, filesize = fv.FileSize(); while (offset < filesize) { if (!fv.AccessBytes(offset, some size here, &data, &len)) break; // error if (len == 0) break; // unexpected EOF // use data up to len bytes as needed... offset += len; } fv.CloseFile(); } 

This code is designed to randomly jump anywhere in the file for any data size. Since you read bytes sequentially, some of them can be simplified as needed.

+3
source

Your new algorithm pays only 0.1% of the input / output inefficiencies ... no worries.

To get a further increase in throughput, you should study the โ€œdo somethingโ€ step more carefully. See if you can reuse part of the result from an overlapping window. Check the cache behavior. Check if there is a better algorithm for the same calculation.

+2
source

You have basic I / O technology. The easiest improvement you can make now is to choose a good buffer size. In some experiments, you will find that read performance increases rapidly with buffer size until you reach about 16k, and then performance starts to level out.

Your next task, probably, is to profile your code and see where it spends its time. When working with performance, it is always better to measure , rather than guessing. You do not indicate which OS you are using, so I will not do any profilers.

You can also try to reduce the amount of copy / move data between your buffer and the workspace. Less copying is usually better. If you can process your data in place instead of moving it to a new place, this is a win. (I see from your rights that you already do.)

Finally, if you process a lot of gigabytes of archived information, you should consider compressing your data. It will come as a surprise to many people that itโ€™s faster to read the compressed data and then unpack it, and just read the unpacked data. My favorite algorithm for this purpose is LZO , which does not compress, as well as some other algorithms, but is quickly decompressed. This type of setup is only worth the engineering effort if:

  • Your work is related to I / O.
  • You read a lot of G data.
  • You often run a program, so it saves you a lot of time to run it faster.
+1
source

Source: https://habr.com/ru/post/1442436/


All Articles