How to access HashSet <TValue> reference values ββwithout listing?
I have a scenario in which saving memory is of utmost importance. I am trying to read in> 1 GB of peptide sequences in memory and group peptide copies together that have the same sequence. I store peptide objects in Hash, so I can quickly check for duplication, but found that you cannot access the objects in Set, even knowing that Set contains this object.
Memory is really important, and I don't want to duplicate data, if at all possible. (Otherwise, I would develop my data structure as: peptides = Dictionary<string, Peptide> , but this would duplicate the string in both the dictionary and the peptide class). Below is the code showing what I would like to execute:
public SomeClass { // Main Storage of all the Peptide instances, class provided below private HashSet<Peptide> peptides = new HashSet<Peptide>(); public void SomeMethod(IEnumerable<string> files) { foreach(string file in files) { using(PeptideReader reader = new PeptideReader(file)) { foreach(DataLine line in reader.ReadNextLine()) { Peptide testPep = new Peptide(line.Sequence); if(peptides.Contains(testPep)) { // ** Problem Is Here ** // I want to get the Peptide object that is in HashSet // so I can add the DataLine to it, I don't want use the // testPep object (even though they are considered "equal") peptides[testPep].Add(line); // I know this doesn't work testPep.Add(line) // THIS IS NO GOOD, since it won't be saved in the HashSet which i use in other methods. } else { // The HashSet doesn't contain this peptide, so we can just add it testPep.Add(line); peptides.Add(testPep); } } } } } } public Peptide : IEquatable<Peptide> { public string Sequence {get;private set;} private int hCode = 0; public PsmList PSMs {get;set;} public Peptide(string sequence) { Sequence = sequence.Replace('I', 'L'); hCode = Sequence.GetHashCode(); } public void Add(DataLine data) { if(PSMs == null) { PSMs = new PsmList(); } PSMs.Add(data); } public override int GethashCode() { return hCode; } public bool Equals(Peptide other) { return Sequence.Equals(other.Sequence); } } public PSMlist : List<DataLine> { // and some other stuff that is not important } Why does the HashSet not allow me to get a reference to the object contained in the HashSet? I know that people will try to say that if HashSet.Contains() returns true, your objects are equivalent. They may be equivalent in terms of values, but I need the links to be the same, since I store additional information in the peptide class.
The only solution I came across is a Dictionary<Peptide, Peptide> , in which both the key and value point to the same link. But that seems sticky. Is there any other data structure for this?
Basically, you can override HashSet<T> yourself, but this is the only solution I know of. The Dictionary<Peptide, Peptide> or Dictionary<string, Peptide> solution is probably not that inefficient, although if you spend only one link to a record, I would suggest that it would be relatively minor.
In fact, if you delete the hCode member from Peptide , which will be safe for you 4 bytes to an object that is the same size as the link in x86 ... it makes no sense to cache the hash as far as possible as I can say, as you only calculate hash each object once, at least in the code you showed.
If you really desperately need memory, I suspect you can save the sequence much more efficiently than string . If you give us more information about what the sequence contains, we can make some suggestions there.
I do not know that there is a particularly strong reason why the HashSet does not allow this, in addition to the fact that this is a relatively rare requirement, but this is what I saw, requested in Java too ...
Use Dictionary<string, Peptide> .