Save a specific part of a huge text file (more than 2 GB)

I have large log files that contain timestamps every second. I need to cut out a user-defined part from this huge file and save it in another text file. I am confused since the fstream class can deal with a max file size of 2 GB and reading all the lines - time and memory.

timestamp template :! <dd.mm.yyyy hh: min: sec> every second and one per line. one prof. the guy suggested using LINQ and readline ().

sample file:

!<<14.12.2012 16:20:03> some text some text some some text some text some some text some text some !<<14.12.2012 16:20:04> some text some text some some text some text some some text some text some some text some text some some text some text some !<<14.12.2012 16:20:05> some text some text some !<<14.12.2012 16:20:06> some text some text some some text some text some 

etc. before EOF.

+4
source share
2 answers

ReadLine is not at all what you want to do ... open the file reader ... find the desired position, read the necessary data (in another file stream).

A "ReadLine" should really read the data ... whereas a search (myStream.Position = whereIWantToGo) is basically instantaneous.

You will handle this in the same way as a sorted database. A database with 1,000,000 records takes only 20 searches to find ... start halfway, too high? just saved 500,000 searches ... are back halfway ... too high? just shaved off another 250,000 attempts ... rinse, repeat.

If you find funny characters (bad encoding)

To your email address (by the way, you really should continue to use SO, not email - this way other people can benefit) ... The answer is that you need to try different types of encoding. Your file cannot be UTF8 encoded (this is what my code expects below). So use new StreamReader("MyLogFile.txt", Encoding.ASCII) or some other encoding until it works for you.

C # console application that should get started

Disclaimer ... this code is unpleasant and may have errors when there is an infinite loop :) ... but here is a console application that should work for you.

 using System; using System.Collections.Generic; using System.Globalization; using System.IO; namespace ConsoleApplication1 { class Program { static void Main(string[] args) { // example dates var lookFor = new DateTime(2012, 12, 14, 16, 20, 02); var readUntilDate = new DateTime(2012, 12, 14, 16, 20, 05); using (var stream = File.OpenText("MyLogFile.txt")) { if (SeekToEntry(stream, lookFor) == false) { Console.WriteLine("Could not find entry for date {0}", lookFor); return; } foreach (var line in ReadEntriesUntil(stream, readUntilDate)) { Console.WriteLine("Line: {0}", line); } } } // This method simply spits out one line at a time until it hits // the target cut-off. static IEnumerable<string> ReadEntriesUntil(StreamReader stream, DateTime target) { while (true) { string line = stream.ReadLine(); if (line == null) { break; } if (line.StartsWith("!<<")) { DateTime entryDate; if (DateTime.TryParseExact(line.Substring(3, 19).Replace(".", ""), @"ddMMyyyy HH:mm:ss", CultureInfo.InvariantCulture, DateTimeStyles.None, out entryDate)) { if (entryDate >= target) { break; } } } yield return line; } } // This method will bounce around the stream till it finds your // target entry date. static bool SeekToEntry(StreamReader stream, DateTime target) { long from = 0; long to = stream.BaseStream.Length; while (true) { long testIndex = (to - from) / 2; stream.BaseStream.Seek(testIndex, SeekOrigin.Begin); var entryDate = GetNextEntryDate(stream, out testIndex); if (entryDate == null || (from == to)) { return false; } switch (entryDate.Value.CompareTo(target)) { case -1: // Found too low... from = testIndex; break; case 1: // Fount too high... to = testIndex; break; default: return true; } } } // This is a function that is meant to keep seeking forward until // it hits an entry date. static DateTime? GetNextEntryDate(StreamReader stream, out long actualIndex) { actualIndex = stream.BaseStream.Position; DateTime? result = null; string line = null; // Find the next entry. while ((line = stream.ReadLine()) != null && line.StartsWith("!<<") == false) ; if (line != null) { actualIndex = stream.BaseStream.Position - line.Length; DateTime timeStamp; if (DateTime.TryParseExact(line.Substring(3, 19).Replace(".", ""), @"ddMMyyyy HH:mm:ss", CultureInfo.InvariantCulture, DateTimeStyles.None, out timeStamp)) { result = timeStamp; } } return result; } } } 
+3
source

Start with an educated guess about how deep the timestamp in the file is. If you cannot do this, start in the middle - essentially do a binary search.

Once you are looking for a place, read a few lines (*) until you find a timestamp. At this point, you either have a timestamp, or you can determine whether it will be before or after the current point. If this is not your timestamp, look back or forth for the logical sum and repeat until you find the timestamp you need.

With this technique, you can probably find your timestamp in just a few dozen views or so.

You might want to read search on MSDN .

* remember that when searching, the file pointer may not be at the beginning of the line. Of course, this method will work, but this is what you need to know about when you narrow your search to a very small range.

+3
source

Source: https://habr.com/ru/post/1386349/


All Articles