String caching. Memory optimization and reuse

Question

String caching. Memory optimization and reuse

I am currently working on a very large legacy application that processes a large amount of string data collected from various sources (IE, names, identifiers, common business-related codes, etc.). Only this data can take up to 200 megabytes of RAM during the application process.

One of my colleagues mentioned one possible strategy to reduce the amount of memory (since many individual lines are duplicated over data sets), it would be to "cache" duplicate lines in the dictionary and reuse them when required. For example...

public class StringCacher() { public readonly Dictionary<string, string> _stringCache; public StringCacher() { _stringCache = new Dictionary<string, string>(); } public string AddOrReuse(string stringToCache) { if (_stringCache.ContainsKey(stringToCache) _stringCache[stringToCache] = stringToCache; return _stringCache[stringToCache]; } }

Then use this caching ...

 public IEnumerable<string> IncomingData() { var stringCache = new StringCacher(); var dataList = new List<string>(); // Add the data, a fair amount of the strings will be the same. dataList.Add(stringCache.AddOrReuse("AAAA")); dataList.Add(stringCache.AddOrReuse("BBBB")); dataList.Add(stringCache.AddOrReuse("AAAA")); dataList.Add(stringCache.AddOrReuse("CCCC")); dataList.Add(stringCache.AddOrReuse("AAAA")); return dataList; }

Since the strings are immutable, and the internal work is done by the framework to make them work similar to value types, I half think that it will just create a copy of each line in the dictionary and just double the amount of memory used, and not just pass the link to the line stored in the dictionary ( which my colleague does).

So, given that this will be done on a massive set of string data ...

Is this saving any memory, assuming that 30% of the string values will be used twice or more?
Is the assumption that this will work correctly?

+6

c # .net

Moog May 19, '13 at 15:35

source share

4 answers

This is already built-in .NET, it is called String.Intern , no need to reinvent.

+3

oleksii May 19, '13 at 15:39

source share

You can achieve this using the built-in .Net features.

When you initialize your string, make a call to string.Intern () with your string.

For instance:

 dataList.Add(string.Intern("AAAA"));

Each subsequent call with the same line will use the same link in memory. Therefore, if you have 1000 AAAA, only 1 copy of AAAA is stored in memory.

+2

Rob aston May 19, '13 at 15:46

source share

Please read about String Interning and take advantage of this feature already available in .Net : to understand this, there are many articles as follows:

Interpreting Wikipedia Strings

Understanding C #: String.Intern makes interesting strings

SO QA

Eric Lippers interned

0

VS1 May 19, '13 at 15:46

source share

Andy brown · Accepted Answer · 2013-05-19T15:43:01+0000

This is essentially string interning, except that you don't have to worry about how this works. In your example, you are still creating a string, then comparing it, and then leaving a copy for deletion..NET will do this for you at runtime.

See also String.Intern and C # String Performance Optimization (C Calvert)

If a new row is created with a type code ( String goober1 = "foo"; String goober2 = "foo"; ) shown on lines 18 and 19, then the intern table is checked. If your row is already present, then both variables will point to the same block of memory that is supported by the intern table.

So, you do not need to minimize it yourself - this will not bring any benefits. EDIT . UNLESS: your lines usually don't live as long as your embedded AppDomain lines live for the duration of the AppDomain, which is not necessarily great for GC. If you want short-lived strings, then you need a pool. From String.Intern :

If you are trying to reduce the total memory allocated by your application, keep in mind that string interning has two undesirable side effects. First, the memory allocated for interned String objects is unlikely to be released until the common CLR runtime completes . The reason is that the CLR reference to the String interned object may persist after the termination of your application or even your application domain ....

EDIT 2 Also see Jon Skeets SO answer here

String caching. Memory optimization and reuse

More articles: