Is my word size normal?

I have a 150 MB file. Each line consists of one format, for example /

I,h,q,q,3,A,5,Q,3,[,5,Q,8,c,3,N,3,E,4,F,4,g,4,I,V,9000,0000001-100,G9999999990001800000000000001,G9999999990000001100PDNELKKMMCNELRQNWJ010, , , , , , ,D,Z, 

I have a Dictionary<string, List<string>>

It is filled by opening the file, reading each line, taking elements from the line and adding it to the dictionary, after which the file is closed.

 StreamReader s = File.OpenText(file); string lineData = null; while ((lineData = s.ReadLine()) != null) { var elements = lineData.Split(','); var compareElements = elements.Take(24); FileData.Add(elements[27], new List<string>(compareElements)); } s.Close(); 

Using the method in this answer , I calculated that my dictionary is 600 MB. This is 4 times larger than the file.

Does this sound right?

+4
source share
6 answers

Most of these objects accept only one character, but you save them as strings. An indicative pointer to this line will occupy at least twice as much space (in the case of UTF8, probably 4-8 times). Then there is the overhead of maintaining a hash table structured for the dictionary.

List<> itself should be a really efficient repository (it uses an array inside itself)

Room for improvement :

  • you can use List<char> or char[] instead of List<string> if you know that the fields will match
  • you can use struct Field { char a,b/*,...*/; } struct Field { char a,b/*,...*/; } and List instead of list if you need more than 1 character in the field
  • You can refuse the desired field extraction [<- recommended]:

      var dict = File.ReadAllLines(file) .ToDictionary(line => line.Split(',')[27]); 

    This gives you the ability to access comparison items on request:

      string[] compareElements = dicts["key27"].Split(',')/*.Take(24).ToArray()*/; 

    This is a classic example of the trade-off between time and storage.

Change obvious hybrid:

 struct AllCompareElements { public char field1, field2, ... field24; // perhaps: public char[2] field13; // for the exceptional field that is longer than 1 character } 

We are happy to use Resharper to implement Equals , GetHashCode , IEquatable<AllCompareElements> , IComparable<AllCompareElements>

+1
source

Besides the fact that the method is not very reliable, in your case there is even more overhead. Have you noticed that each iteration of the loop creates a new instance of the elements array, lineData and elements.Take also contains some internal variables that are created with every call? Since you probably have enough RAM, the .NET garbage collector does not collect them, so when you measure TotalMemory before and after the loop, you also measure all of these variables, not just your dictionary, although this may be the only thing that subsequently remains in scope.

+3
source

Yes, because you rotate characters into line pointers, each of which has 4 or 8 bytes.

+1
source

I assume your file is UTF-8 encoded and contains mostly ASCII. strings in C # are UTF-16, so this explains most of the difference in size (factor 2). Of course, there is also some overhead for data structures.

+1
source

If your file is encoded in ANSI or UTF-8 (but without special characters, then the size will be the same as ANSI ) (each char is 1 byte) and string is "Represents text as a sequence of Unicode characters". (Unicode = UTF-16, each char 4 bytes) 4 times larger.

+1
source

This 600M was highlighted by the operation of loading a file into the dictionary ... It offers it an expensive operation, and it would be useful to determine how effective any optimization is, but how much dictionary takes up the dictionary is pretty useless.

I would put off splitting, as suggested, right now.

It looks like you pre-optimized the speed, and it cost you a lot of style when printing in the memory foot.

0
source

Source: https://habr.com/ru/post/1380387/


All Articles