Extremely large single line file.

I download data from the site, and the site gives me data in very large blocks. In the largest block there are "pieces" that I need to disassemble individually. These "chunks" begin with "(ClinicalData)" and end with "(/ ClinicalData)". Therefore, an example line would look something like this:

(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)

In “ideal” circumstances, a block is intended for one line of data, but sometimes erroneous newline characters appear. Since I want to parse fragments (ClinicalData) in a block, I want my data to be parsed sequentially. So I take a text file, read it all in a StringBuilder, delete new lines (just in case), and then insert my own translation lines, so I can read line by line.

StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);

// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");

// set my own newline characters so the data becomes parse-able by line 
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");

// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());

This works fine (although it may be inefficient, but at least I really like it :)) until I came across a piece of data that gives me as a large file of 280 MB in size.

System.OutOfMemoryException , , . , , StringBuilder 280 ? , , regex.match "(ClinicalData) , . (: .ReadBytes).

, 280MB , , , !

+1
2

, . , , StreamReader. lookahead, , :

enum ReadState
{
    Start,
    SawOpen
}


using (var sr = new StreamReader(@"path\to\clinic.txt"))
using (var sw = new StreamWriter(@"path\to\output.txt"))
{
    var rs = ReadState.Start;
    while (true)
    {
        var r = sr.Read();
        if (r < 0)
        {
            if (rs == ReadState.SawOpen)
                sw.Write('<');
            break;
        }

        char c = (char) r;
        if ((c == '\r') || (c == '\n'))
            continue;

        if (rs == ReadState.SawOpen)
        {
            if (c == 'C')
                sw.WriteLine();

            sw.Write('<');
            rs = ReadState.Start;
        }

        if (c == '<')
        {
            rs = ReadState.SawOpen;
            continue;
        }

        sw.Write(c);
    }
}
+1

-, , StringBuilder, . :

File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");

StreamReader ? "", , (ClinicalData) (/ClinicalData). , :

        char[] buffer = new char[1024];
        string remainder = string.Empty;
        List<ClientData> list = new List<ClientData>();

        using (StreamReader reader = File.OpenText(@"source.txt"))
        {
            while (reader.Read(buffer, 0, 1024) > 0)
            {
                remainder = Parse(remainder + new string(buffer), list);
            }
        }

:

string Parse(string value, List<ClientData> list)
{
    string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
    for (int i = 0; i < parts.Length - 1; i++)
        list.Add(new ClientData(parts[i]));

    return parts[parts.Length - 1];
}

ClientData, :

class ClientData
{
    public ClientData(string value)
    {
        // fill in however you are already parsing out ID, and other info
    }
}

- , , , .

StreamReader ReadLine() - . , ClinicalData. , . http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx

, XML , XmlReader - . http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx

0

Source: https://habr.com/ru/post/1568806/


All Articles