Parsing without line breaks

This is a byproduct of a discussion on some other issues .

Suppose I need to parse a huge number of very long lines. Each line contains a double sequence (in a textual representation, of course), separated by a space. I need to parse double into List<double> .

The standard parsing method (using string.Split + double.TryParse ) seems rather slow: for each of the numbers we need to select a line.

I tried to make it the old C-like way: compute the beginning and end indices of substrings containing numbers, and parse it β€œin place” without creating an extra line. (See http://ideone.com/Op6h0 , the relevant part is shown below.)

 int startIdx, endIdx = 0; while(true) { startIdx = endIdx; // no find_first_not_of in C# while (startIdx < s.Length && s[startIdx] == ' ') startIdx++; if (startIdx == s.Length) break; endIdx = s.IndexOf(' ', startIdx); if (endIdx == -1) endIdx = s.Length; // how to extract a double here? } 

There is an overload of string.IndexOf , searching only inside the given substring, but I could not find a method for parsing the double from the substring, without first extracting this substring.

Does anyone have any ideas?

+6
source share
2 answers

There is no managed API to parse a double from a substring. I assume that line highlighting will be inconsequential compared to all floating point operations in double.Parse.

In any case, you can save the selection by creating the string "buffer" once out of 100, consisting only of spaces. Then, for each line that you want to parse, you copy the characters to this buffer line using unsafe code . You fill the buffer line with a space. And for parsing, you can use NumberStyles.AllowTrailingWhite, which will ignore forgetting spaces.

Getting a pointer to a string is actually a fully supported operation:

  string l_pos = new string(' ', 100); //don't write to a shared string! unsafe { fixed (char* l_pSrc = l_pos) { // do some work } } 

C # has special syntax for binding a string to char *.

+7
source

If you want to do this very quickly, I would use a state machine

it might look like this:

 enum State { Separator, Sign, Mantisse etc. } State CurrentState = State.Separator; int Prefix, Exponent, Mantisse; foreach(var ch in InputString) { switch(CurrentState) { // set new currentstate in dependence of ch and CurrentState case Separator: GotNewDouble(Prefix, Exponent, Mantisse); } } 
+2
source

Source: https://habr.com/ru/post/913236/


All Articles