Improving parsing performance

Question

Improving parsing performance

Before you begin, I know the term "premature optimization." However, the following fragments turned out to be an area where improvements can be made.

Good. We currently have network code that works with string packages. I know that using string for packages is stupid, crazy and slow. Unfortunately, we have no control over the client and therefore must use strings.

Each package terminates \0\r\n , and currently we use StreamReader / Writer to read individual packages from the stream. Our main bottleneck comes from two places.

First: we need to trim this nasty little null byte from the end of the line. We are currently using the following code:

 line = await reader.ReadLineAsync(); line = line.Replace("\0", ""); // PERF this allocates a new string if (string.IsNullOrWhiteSpace(line)) return null; var packet = ClientPacket.Parse(line, cl.Client.RemoteEndPoint);

As you can see from this cute little comment, we have a GC performance issue when trimming "\ 0". There are many different ways to cut "\ 0" from the end of a line, but all of this will lead to the same GC success we get. Since all string operations are immutable, they lead to the creation of a new string object. Since our server processes 1000+ connections that exchange data at a speed of about 25-40 packets per second (this is a game server), this GC question becomes a problem. So, here is my first question: what is a more efficient way to trim this "\ 0" from the end of our line? By effective, I mean not only speed, but also GC wise (ultimately, I would like to get rid of it without creating a new string object!).

Our second problem is also related to GC land. Our code looks something like this:

 private static string[] emptyStringArray = new string[] { }; // so we dont need to allocate this public static ClientPacket Parse(string line, EndPoint from) { const char seperator = '|'; var first_seperator_pos = line.IndexOf(seperator); if (first_seperator_pos < 1) { return new ClientPacket(NetworkStringToClientPacketType(line), emptyStringArray, from); } var name = line.Substring(0, first_seperator_pos); var type = NetworkStringToClientPacketType(name); if (line.IndexOf(seperator, first_seperator_pos + 1) < 1) return new ClientPacket(type, new string[] { line.Substring(first_seperator_pos + 1) }, from); return new ClientPacket(type, line.Substring(first_seperator_pos + 1).Split(seperator), from); }

(Where NetworkStringToClientPacketType is just a big block block)

As you can see, we are already doing a few things to handle the GC. We reuse the old "empty" line and check the packets without parameters. My only problem here is that we use a lot of substring a lot and even connect Split at the end of the substring. This results in (for the average package) creating almost 20 new string objects and 12 deleting EVERY PACKAGE. This causes a lot of performance problems when loading increases something over 400 users (we got a quick ram: 3)

Has anyone had experience with this kind of thing before, or could give us some pointers to what to look next? Maybe some magic classes or some elegant pointer wizard?

(PS. StringBuilder does not help, since we do not build strings, we usually split them.)

Currently, we have ideas based on an index-based system, where we store the index and length of each parameter, and not separate them. Thoughts?

A few other things. By decompiling mscorlib and looking at the string class code, it seems to me that IndexOf calls are made through P / Invoke, which means they added overhead for each call, correct me if I am wrong? Wouldn't it be faster to implement IndexOf manually using the char[] array?

 public int IndexOf(string value, int startIndex, int count, StringComparison comparisonType) { ... return TextInfo.IndexOfStringOrdinalIgnoreCase(this, value, startIndex, count); ... } internal static int IndexOfStringOrdinalIgnoreCase(string source, string value, int startIndex, int count) { ... if (TextInfo.TryFastFindStringOrdinalIgnoreCase(4194304, source, startIndex, value, count, ref result)) { return result; } ... } ... [DllImport("QCall", CharSet = CharSet.Unicode)] [return: MarshalAs(UnmanagedType.Bool)] private static extern bool InternalTryFindStringOrdinalIgnoreCase(int searchFlags, string source, int sourceCount, int startIndex, string target, int targetCount, ref int foundIndex);

Then we get a String.Split string that ends with a Substring call (somewhere along the line):

 // string private string[] InternalSplitOmitEmptyEntries(int[] sepList, int[] lengthList, int numReplaces, int count) { int num = (numReplaces < count) ? (numReplaces + 1) : count; string[] array = new string[num]; int num2 = 0; int num3 = 0; int i = 0; while (i < numReplaces && num2 < this.Length) { if (sepList[i] - num2 > 0) { array[num3++] = this.Substring(num2, sepList[i] - num2); } num2 = sepList[i] + ((lengthList == null) ? 1 : lengthList[i]); if (num3 == count - 1) { while (i < numReplaces - 1) { if (num2 != sepList[++i]) { break; } num2 += ((lengthList == null) ? 1 : lengthList[i]); } break; } i++; } if (num2 < this.Length) { array[num3++] = this.Substring(num2); } string[] array2 = array; if (num3 != num) { array2 = new string[num3]; for (int j = 0; j < num3; j++) { array2[j] = array[j]; } } return array2; }

Fortunately, the substring looks fast (and efficient!):

 private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy) { if (startIndex == 0 && length == this.Length && !fAlwaysCopy) { return this; } string text = string.FastAllocateString(length); fixed (char* ptr = &text.m_firstChar) { fixed (char* ptr2 = &this.m_firstChar) { string.wstrcpy(ptr, ptr2 + (IntPtr)startIndex, length); } } return text; }

After reading this answer here , I think a pointer-based solution can be found ... Thoughts?

Thanks.

+6

performance garbage-collection string c # parsing

jduncanator Sep 09 '13 at 6:54

source share

1 answer

xanatos · Answer 1 · 2013-09-09T07:10:29+0000

You can "trick" and work at the level of Encoder ...

 public class UTF8NoZero : UTF8Encoding { public override Decoder GetDecoder() { return new MyDecoder(); } } public class MyDecoder : Decoder { public Encoding UTF8 = new UTF8Encoding(); public override int GetCharCount(byte[] bytes, int index, int count) { return UTF8.GetCharCount(bytes, index, count); } public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) { int count2 = UTF8.GetChars(bytes, byteIndex, byteCount, chars, charIndex); int i, j; for (i = charIndex, j = charIndex; i < charIndex + count2; i++) { if (chars[i] != '\0') { chars[j] = chars[i]; j++; } } for (int k = j; k < charIndex + count2; k++) { chars[k] = '\0'; } return count2 + (i - j); } }

Note that this cheat is based on the fact that StreamReader.ReadLineAsync uses only GetChars() . We remove '\ 0' in the temporary char[] buffer used by StreamReader.ReadLineAsync .

Improving parsing performance

More articles: