How to get line number in XML file when it exceeds int.Maxvalue

I can not get linenumber in an XML file that is about 300 GB. IXmlLineInfo.LineNumber is int32, and when it exceeds int.MaxValue, a negative number is returned. It doesn't matter if I use int or long to store the linear number - like both. Xmlreader is able to read eof. Using .net 2.0 and the latest version also uses int32.

public void ReadLines() { long readcounter = 0; long linenumber = 0; fname = "I:\\XML Files\\europe-latest.osm"; XmlReaderSettings settings = new XmlReaderSettings(); settings.ProhibitDtd = false; settings.XmlResolver = null; XmlReader reader = XmlReader.Create(fname, settings); IXmlLineInfo lineInfo = ((IXmlLineInfo)reader); try { while (reader.Read()) { linenumber = lineInfo.LineNumber; readcounter++; if (readcounter % 1000000 == 0) Console.WriteLine(linenumber.ToString()); } } catch (XmlException ex) { Console.WriteLine(ex.Message); Console.ReadLine(); } finally { reader.Close(); Console.WriteLine(DateTime.Now.ToLongTimeString()); } } 
+6
source share
2 answers

You cannot try:

1) Use System.Numerics.BigInteger to keep the actual line number. - Check after each operation that the line number is not less than it was before, storing the actual line number in BigInteger. Well, in a very huge file, it can actually overflow and become larger than before (after reading, for example, 5 billion line elements in a few internal increments):

 var actualLine = new System.Numerics.BigInteger(0); Int32 lastInt32Line = lineInfo.LineNumber; // Some Xml reading Int32 diff = lineInfo.LineNumber - lastLine; // If an overflow has happened - add overflow if (diff >= 0) actualLine += (new BigInteger(Int32.MaxValue)) * 2 - diff; else // Everything is normal - add the diff actualLine += diff; 

The real possible problem is that even though you store the line number correctly, the internal elements of the XmlReader may start to crash. In my opinion, the integer arithmetic code checked should be by default, and not uncontrollable, as it is now - when an overflow occurs, the class is damaged, unless it is explicitly specified otherwise.

2) Reorganize the data warehouse to process data more fragmented.
3) Write your own XmlReader that uses BigInteger.

+1
source

After a little research with dotpeek, the problem seems to be deeply rooted in the inner class of XmlTextReaderImpl (this should be the actual type of reader that you are using) and the inner types that it uses:

 internal struct LineInfo { internal int lineNo; internal int linePos; // ... } 

If you want to get closer to this with the minimum required work, I recommend that you get the .NET source code , create your own Xml reader by copying XmlTextReaderImpl (and all related internal types), replacing all int line numbers with BigInteger s. If you want to hide the type, you can create an IXmlBigLineInfo or similar interface, and use it instead of IXmlLineInfo .

Hope this helps.

+1
source

Source: https://habr.com/ru/post/971555/


All Articles