I am currently working on a very large legacy application that processes a large amount of string data collected from various sources (IE, names, identifiers, common business-related codes, etc.). Only this data can take up to 200 megabytes of RAM during the application process.
One of my colleagues mentioned one possible strategy to reduce the amount of memory (since many individual lines are duplicated over data sets), it would be to "cache" duplicate lines in the dictionary and reuse them when required. For example...
public class StringCacher() { public readonly Dictionary<string, string> _stringCache; public StringCacher() { _stringCache = new Dictionary<string, string>(); } public string AddOrReuse(string stringToCache) { if (_stringCache.ContainsKey(stringToCache) _stringCache[stringToCache] = stringToCache; return _stringCache[stringToCache]; } }
Then use this caching ...
public IEnumerable<string> IncomingData() { var stringCache = new StringCacher(); var dataList = new List<string>();
Since the strings are immutable, and the internal work is done by the framework to make them work similar to value types, I half think that it will just create a copy of each line in the dictionary and just double the amount of memory used, and not just pass the link to the line stored in the dictionary ( which my colleague does).
So, given that this will be done on a massive set of string data ...
Is this saving any memory, assuming that 30% of the string values will be used twice or more?
Is the assumption that this will work correctly?
Moog source share