Windows fsync performance (FlushFileBuffers) with large files

Question

Windows fsync performance (FlushFileBuffers) with large files

From information on securing data on disk ( http://winntfs.com/2012/11/29/windows-write-caching-part-2-an-overview-for-application-developers/ ), even in the case of, for example, that on Windows platforms, you need to rely on your "fsync" version of FlushFileBuffers to get the best guarantee that buffers are actually flushed from the device caches to the media itself. The combination of FILE_FLAG_NO_BUFFERING with FILE_FLAG_WRITE_THROUGH does not clear the device cache, but simply affects the file system cache if this information is correct.

Given the fact that I will be working with fairly large files that need to be updated "transactionally", this means that "fsync" is executed at the end of the transactional commit. So I created a tiny application to test the performance of this. It basically writes sequentially a packet of 8 random bytes with a memory size using 8 records, and then flushes it. The batch is repeated in a loop, and after each number of written pages, it records performance. In addition, it has two configurable options: fsync on the flash and writing the byte to the last position of the file before starting to write the page.

 // Code updated to reflect new results as discussed in answer below. // 26/Aug/2013: Code updated again to reflect results as discussed in follow up question. // 28/Aug/2012: Increased file stream buffer to ensure 8 page flushes. class Program { static void Main(string[] args) { BenchSequentialWrites(reuseExistingFile:false); } public static void BenchSequentialWrites(bool reuseExistingFile = false) { Tuple<string, bool, bool, bool, bool>[] scenarios = new Tuple<string, bool, bool, bool, bool>[] { // output csv, fsync?, fill end?, write through?, mem map? Tuple.Create("timing FS-EBF.csv", true, false, false, false), Tuple.Create("timing NS-EBF.csv", false, false, false, false), Tuple.Create("timing FS-LB-BF.csv", true, true, false, false), Tuple.Create("timing NS-LB-BF.csv", false, true, false, false), Tuple.Create("timing FS-E-WT-F.csv", true, false, true, false), Tuple.Create("timing NS-E-WT-F.csv", false, false, true, false), Tuple.Create("timing FS-LB-WT-F.csv", true, true, true, false), Tuple.Create("timing NS-LB-WT-F.csv", false, true, true, false), Tuple.Create("timing FS-EB-MM.csv", true, false, false, true), Tuple.Create("timing NS-EB-MM.csv", false, false, false, true), Tuple.Create("timing FS-LB-B-MM.csv", true, true, false, true), Tuple.Create("timing NS-LB-B-MM.csv", false, true, false, true), Tuple.Create("timing FS-E-WT-MM.csv", true, false, true, true), Tuple.Create("timing NS-E-WT-MM.csv", false, false, true, true), Tuple.Create("timing FS-LB-WT-MM.csv", true, true, true, true), Tuple.Create("timing NS-LB-WT-MM.csv", false, true, true, true), }; foreach (var scenario in scenarios) { Console.WriteLine("{0,-12} {1,-16} {2,-16} {3,-16} {4:F2}", "Total pages", "Interval pages", "Total time", "Interval time", "MB/s"); CollectGarbage(); var timingResults = SequentialWriteTest("test.data", !reuseExistingFile, fillEnd: scenario.Item3, nPages: 200 * 1000, fSync: scenario.Item2, writeThrough: scenario.Item4, writeToMemMap: scenario.Item5); using (var report = File.CreateText(scenario.Item1)) { report.WriteLine("Total pages,Interval pages,Total bytes,Interval bytes,Total time,Interval time,MB/s"); foreach (var entry in timingResults) { Console.WriteLine("{0,-12} {1,-16} {2,-16} {3,-16} {4:F2}", entry.Item1, entry.Item2, entry.Item5, entry.Item6, entry.Item7); report.WriteLine("{0},{1},{2},{3},{4},{5},{6}", entry.Item1, entry.Item2, entry.Item3, entry.Item4, entry.Item5.TotalSeconds, entry.Item6.TotalSeconds, entry.Item7); } } } } public unsafe static IEnumerable<Tuple<long, long, long, long, TimeSpan, TimeSpan, double>> SequentialWriteTest( string fileName, bool createNewFile, bool fillEnd, long nPages, bool fSync = true, bool writeThrough = false, bool writeToMemMap = false, long pageSize = 4096) { // create or open file and if requested fill in its last byte. var fileMode = createNewFile ? FileMode.Create : FileMode.OpenOrCreate; using (var tmpFile = new FileStream(fileName, fileMode, FileAccess.ReadWrite, FileShare.ReadWrite, (int)pageSize)) { Console.WriteLine("Opening temp file with mode {0}{1}", fileMode, fillEnd ? " and writing last byte." : "."); tmpFile.SetLength(nPages * pageSize); if (fillEnd) { tmpFile.Position = tmpFile.Length - 1; tmpFile.WriteByte(1); tmpFile.Position = 0; tmpFile.Flush(true); } } // Make sure any flushing / activity has completed System.Threading.Thread.Sleep(TimeSpan.FromMinutes(1)); System.Threading.Thread.SpinWait(50); // warm up. var buf = new byte[pageSize]; new Random().NextBytes(buf); var ms = new System.IO.MemoryStream(buf); var stopwatch = new System.Diagnostics.Stopwatch(); var timings = new List<Tuple<long, long, long, long, TimeSpan, TimeSpan, double>>(); var pageTimingInterval = 8 * 2000; var prevPages = 0L; var prevElapsed = TimeSpan.FromMilliseconds(0); // Open file const FileOptions NoBuffering = ((FileOptions)0x20000000); var options = writeThrough ? (FileOptions.WriteThrough | NoBuffering) : FileOptions.None; using (var file = new FileStream(fileName, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite, (int)(16 *pageSize), options)) { stopwatch.Start(); if (writeToMemMap) { // write pages through memory map. using (var mmf = MemoryMappedFile.CreateFromFile(file, Guid.NewGuid().ToString(), file.Length, MemoryMappedFileAccess.ReadWrite, null, HandleInheritability.None, true)) using (var accessor = mmf.CreateViewAccessor(0, file.Length, MemoryMappedFileAccess.ReadWrite)) { byte* base_ptr = null; accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref base_ptr); var offset = 0L; for (long i = 0; i < nPages / 8; i++) { using (var memStream = new UnmanagedMemoryStream(base_ptr + offset, 8 * pageSize, 8 * pageSize, FileAccess.ReadWrite)) { for (int j = 0; j < 8; j++) { ms.CopyTo(memStream); ms.Position = 0; } } FlushViewOfFile((IntPtr)(base_ptr + offset), (int)(8 * pageSize)); offset += 8 * pageSize; if (fSync) FlushFileBuffers(file.SafeFileHandle); if (((i + 1) * 8) % pageTimingInterval == 0) timings.Add(Report(stopwatch.Elapsed, ref prevElapsed, (i + 1) * 8, ref prevPages, pageSize)); } accessor.SafeMemoryMappedViewHandle.ReleasePointer(); } } else { for (long i = 0; i < nPages / 8; i++) { for (int j = 0; j < 8; j++) { ms.CopyTo(file); ms.Position = 0; } file.Flush(fSync); if (((i + 1) * 8) % pageTimingInterval == 0) timings.Add(Report(stopwatch.Elapsed, ref prevElapsed, (i + 1) * 8, ref prevPages, pageSize)); } } } timings.Add(Report(stopwatch.Elapsed, ref prevElapsed, nPages, ref prevPages, pageSize)); return timings; } private static Tuple<long, long, long, long, TimeSpan, TimeSpan, double> Report(TimeSpan elapsed, ref TimeSpan prevElapsed, long curPages, ref long prevPages, long pageSize) { var intervalPages = curPages - prevPages; var intervalElapsed = elapsed - prevElapsed; var intervalPageSize = intervalPages * pageSize; var mbps = (intervalPageSize / (1024.0 * 1024.0)) / intervalElapsed.TotalSeconds; prevElapsed = elapsed; prevPages = curPages; return Tuple.Create(curPages, intervalPages, curPages * pageSize, intervalPageSize, elapsed, intervalElapsed, mbps); } private static void CollectGarbage() { GC.Collect(); GC.WaitForPendingFinalizers(); System.Threading.Thread.Sleep(200); GC.Collect(); GC.WaitForPendingFinalizers(); System.Threading.Thread.SpinWait(10); } [DllImport("kernel32.dll", SetLastError = true)] static extern bool FlushViewOfFile( IntPtr lpBaseAddress, int dwNumBytesToFlush); [DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)] static extern bool FlushFileBuffers(SafeFileHandle hFile); }

The performance results that I get (64-bit Win 7, slow spindle drive) are not very encouraging. It seems that the performance of "fsync" is largely dependent on the size of the file being cleaned, so it dominates the time, and not the amount of dirty data that needs to be cleaned. The graph below shows the results for 4 different settings for a small test application.

Benchmark timing for 4 the scenarios

As you can see, the performance of "fsync" decreases exponentially as the file grows (until it stops at a few GB). In addition, the disk itself does not seem to be very large (for example, the resource monitor shows its active time as only about a few percent, and its disk queue in most cases is empty most of the time).

I obviously expected the performance of "fsync" to be slightly worse than a regular buffered shift, but I expected it to be more or less constant and independent of file size. Thus, it seems that this cannot be used in conjunction with one large file.

Does anyone have any explanations, different impressions or another solution that allows for data storage on disk and more or less constant predicted performance?

UPDATED See New information in the answer below.

+6

performance c # windows fsync

Alex Aug 16 '13 at 15:08

source share

4 answers

hyc · Answer 1 · 2013-08-25T13:55:28+0000

Your test shows an exponential decrease in speed on synchronization starts, because you recreate the file every time. In this case, it is no longer just a sequential record - each record also enlarges the file, which requires several attempts to update the file metadata in the file system. If you completed all of these tasks using a previously existing fully dedicated file, you will see a much faster result, because none of these metadata updates will interfere.

I did a similar test on my Linux box. The results each time you re-create the file:

 mmap direct last sync time 0 0 0 0 0.882293s 0 0 0 1 27.050636s 0 0 1 0 0.832495s 0 0 1 1 26.966625s 0 1 0 0 5.775266s 0 1 0 1 22.063392s 0 1 1 0 5.265739s 0 1 1 1 24.203251s 1 0 0 0 1.031684s 1 0 0 1 28.244678s 1 0 1 0 1.031888s 1 0 1 1 29.540660s 1 1 0 0 1.032883s 1 1 0 1 29.408005s 1 1 1 0 1.035110s 1 1 1 1 28.948555s

Results using an existing file (obviously, the case of last_byte doesn't matter here. In addition, the very first result also had to create a file):

 mmap direct last sync time 0 0 0 0 1.199310s 0 0 0 1 7.858803s 0 0 1 0 0.184925s 0 0 1 1 8.320572s 0 1 0 0 4.047780s 0 1 0 1 4.066993s 0 1 1 0 4.042564s 0 1 1 1 4.307159s 1 0 0 0 3.596712s 1 0 0 1 8.284428s 1 0 1 0 0.242584s 1 0 1 1 8.070947s 1 1 0 0 0.240500s 1 1 0 1 8.213450s 1 1 1 0 0.240922s 1 1 1 1 8.265024s

(Note that I used only 10,000 pieces, not 25,000 pieces, so this is just a 320 MB record using the ext2 file system. I didn't have ext2fs anymore, my big fs is XFS, and it refused to allow mmap + direct input / output.)

Here's the code if you're interested:

 #define _GNU_SOURCE 1 #include <malloc.h> #include <string.h> #include <stdlib.h> #include <errno.h> #include <sys/types.h> #include <sys/time.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #define USE_MMAP 8 #define USE_DIRECT 4 #define USE_LAST 2 #define USE_SYNC 1 #define PAGE 4096 #define CHUNK (8*PAGE) #define NCHUNKS 10000 #define STATI 1000 #define FSIZE (NCHUNKS*CHUNK) main() { int i, j, fd, rc, stc; char *data = valloc(CHUNK); char *map, *dst; char sfname[8]; struct timeval start, end, stats[NCHUNKS/STATI+1]; FILE *sfile; printf("mmap\tdirect\tlast\tsync\ttime\n"); for (i=0; i<16; i++) { int oflag = O_CREAT|O_RDWR|O_TRUNC; if (i & USE_DIRECT) oflag |= O_DIRECT; fd = open("dummy", oflag, 0666); ftruncate(fd, FSIZE); if (i & USE_LAST) { lseek(fd, 0, SEEK_END); write(fd, data, 1); lseek(fd, 0, SEEK_SET); } if (i & USE_MMAP) { map = mmap(NULL, FSIZE, PROT_WRITE, MAP_SHARED, fd, 0); if (map == (char *)-1L) { perror("mmap"); exit(1); } dst = map; } sprintf(sfname, "%x.csv", i); sfile = fopen(sfname, "w"); stc = 1; printf("%d\t%d\t%d\t%d\t", (i&USE_MMAP)!=0, (i&USE_DIRECT)!=0, (i&USE_LAST)!=0, i&USE_SYNC); fflush(stdout); gettimeofday(&start, NULL); stats[0] = start; for (j = 1; j<=NCHUNKS; j++) { if (i & USE_MMAP) { memcpy(dst, data, CHUNK); if (i & USE_SYNC) msync(dst, CHUNK, MS_SYNC); dst += CHUNK; } else { write(fd, data, CHUNK); if (i & USE_SYNC) fdatasync(fd); } if (!(j % STATI)) { gettimeofday(&end, NULL); stats[stc++] = end; } } end.tv_usec -= start.tv_usec; if (end.tv_usec < 0) { end.tv_sec--; end.tv_usec += 1000000; } end.tv_sec -= start.tv_sec; printf(" %d.%06ds\n", (int)end.tv_sec, (int)end.tv_usec); if (i & USE_MMAP) munmap(map, FSIZE); close(fd); for (j=NCHUNKS/STATI; j>0; j--) { stats[j].tv_usec -= stats[j-1].tv_usec; if (stats[j].tv_usec < 0) { stats[j].tv_sec--; stats[j].tv_usec+= 1000000; } stats[j].tv_sec -= stats[j-1].tv_sec; } for (j=1; j<=NCHUNKS/STATI; j++) fprintf(sfile, "%d\t%d.%06d\n", j*STATI*CHUNK, (int)stats[j].tv_sec, (int)stats[j].tv_usec); fclose(sfile); } }

hyc · Answer 2 · 2013-08-25T20:27:08+0000

Here is the code for my syntax version of Windows. I just ran it inside VirtualBox vm, so I don’t think I have useful numbers to compare, but you can give it a chance to compare with your C # numbers on your computer. I am passing OPEN_ALWAYS to CreateFile, so it will reuse the existing file. Change this flag to CREATE_ALWAYS if you want to test again with an empty file every time.

One thing that I noticed is that the results were much faster when I first started this program. NTFS may not be very effective at overwriting existing data, and the effects of file fragmentation appeared in subsequent runs.

 #include <windows.h> #include <stdio.h> #define USE_MMAP 8 #define USE_DIRECT 4 #define USE_LAST 2 #define USE_SYNC 1 #define PAGE 4096 #define CHUNK (8*PAGE) #define NCHUNKS 10000 #define STATI 1000 #define FSIZE (NCHUNKS*CHUNK) static LARGE_INTEGER cFreq; int gettimeofday(struct timeval *tv, void *unused) { LARGE_INTEGER count; if (!cFreq.QuadPart) { QueryPerformanceFrequency(&cFreq); } QueryPerformanceCounter(&count); tv->tv_sec = count.QuadPart / cFreq.QuadPart; count.QuadPart %= cFreq.QuadPart; count.QuadPart *= 1000000; tv->tv_usec = count.QuadPart / cFreq.QuadPart; return 0; } main() { int i, j, rc, stc; HANDLE fd; char *data = _aligned_malloc(CHUNK, PAGE); char *map, *dst; char sfname[8]; struct timeval start, end, stats[NCHUNKS/STATI+1]; FILE *sfile; DWORD len; printf("mmap\tdirect\tlast\tsync\ttime\n"); for (i=0; i<16; i++) { int oflag = FILE_ATTRIBUTE_NORMAL; if (i & USE_DIRECT) oflag |= FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH; fd = CreateFile("dummy", GENERIC_READ|GENERIC_WRITE, 0, NULL, OPEN_ALWAYS, oflag, NULL); SetFilePointer(fd, FSIZE, NULL, FILE_BEGIN); SetEndOfFile(fd); if (i & USE_LAST) WriteFile(fd, data, 1, &len, NULL); SetFilePointer(fd, 0, NULL, FILE_BEGIN); if (i & USE_MMAP) { HANDLE mh; mh = CreateFileMapping(fd, NULL, PAGE_READWRITE, 0, FSIZE, NULL); map = MapViewOfFile(mh, FILE_MAP_WRITE, 0, 0, FSIZE); CloseHandle(mh); dst = map; } sprintf(sfname, "%x.csv", i); sfile = fopen(sfname, "w"); stc = 1; printf("%d\t%d\t%d\t%d\t", (i&USE_MMAP)!=0, (i&USE_DIRECT)!=0, (i&USE_LAST)!=0, i&USE_SYNC); fflush(stdout); gettimeofday(&start, NULL); stats[0] = start; for (j = 1; j<=NCHUNKS; j++) { if (i & USE_MMAP) { memcpy(dst, data, CHUNK); FlushViewOfFile(dst, CHUNK); dst += CHUNK; } else { WriteFile(fd, data, CHUNK, &len, NULL); } if (i & USE_SYNC) FlushFileBuffers(fd); if (!(j % STATI)) { gettimeofday(&end, NULL); stats[stc++] = end; } } end.tv_usec -= start.tv_usec; if (end.tv_usec < 0) { end.tv_sec--; end.tv_usec += 1000000; } end.tv_sec -= start.tv_sec; printf(" %d.%06ds\n", (int)end.tv_sec, (int)end.tv_usec); if (i & USE_MMAP) UnmapViewOfFile(map); CloseHandle(fd); for (j=NCHUNKS/STATI; j>0; j--) { stats[j].tv_usec -= stats[j-1].tv_usec; if (stats[j].tv_usec < 0) { stats[j].tv_sec--; stats[j].tv_usec+= 1000000; } stats[j].tv_sec -= stats[j-1].tv_sec; } for (j=1; j<=NCHUNKS/STATI; j++) fprintf(sfile, "%d\t%d.%06d\n", j*STATI*CHUNK, (int)stats[j].tv_sec, (int)stats[j].tv_usec); fclose(sfile); } }

Alex · Answer 3 · 2013-08-19T06:51:55+0000

I experimented and expanded and found a solution that might be acceptable to me (although at the moment I only tested sequential records). During this process, I discovered some unexpected actions that raise a number of new questions. I will post a new SO question ( Explanation / Information: Windows records I / O performance using "fsync" (FlushFileBuffers) ) for them.

I added the following two additional parameters to my test:

Use unbuffered / writable records (i.e. specifying the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH )
Write to the file indirectly through a memory mapping file.

This gave me some unexpected results, one of which gives me a more or less acceptable solution to my problem. When "fsyncing" is combined with unbuffered / numbered I / O, I do not observe an exponential decline in write speed. This way (although it's not very fast), this gives me a solution that allows me to guarantee that the data is on disk, and which has constant predictable performance that is not affected by file size.

A few other unexpected results were as follows:

If the byte is written to the last position in the file, the write throughput is almost doubled before the page with the parameters “fsync” and “without buffering / recording” is written.
Performance without buffering / writing with or without fsync is almost identical, unless the byte was written to the last position in the file. The skip of writing a “no buffering / writing” scenario without “fsync” in an empty file is about 12.5 MB / s, whereas in the same script in a file with a byte written at the last position in the file, the throughput is three times higher when 37 MB / s
Writing to a file indirectly through a memory-mapped file in combination with "fsync" shows the same exponential reduction in throughput as observed in buffered direct write to a file, even if "unbuffered / written" is specified in the file.

I added the updated code that I used for the test to my original question.

The following graph shows some additional new results.

Harry johnston · Answer 4 · 2013-08-22T21:56:51+0000

[Wrong; see comments.]

I believe the article you are referencing is incorrect, stating that FlushFileBuffers has any beneficial effect on unbuffered I / O. This applies to Microsoft paper, but this document does not require such approval.

According to the documentation, using unbuffered I / O has the same effect as being more efficient than calling FlushFileBuffer after each write. So the practical solution is to use unbuffered I / O rather than using FlushFileBuffer.

Note, however, that using a memory mapped file affects the buffering settings. I would not recommend using a memory mapped file if you are trying to dump data to disk as soon as possible.

Windows fsync performance (FlushFileBuffers) with large files

More articles: