I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
Data is downloaded from files belonging to the same group. Each group can have multiples of the same identifier, each of which comes from different files. I don't care about duplicates, so I thought that a good way to preserve all this would be to throw it into a Set type. But there's a problem.
Sometimes for the same identifier descriptions may vary slightly as follows:
IPI00110753
- Alpha-1A tubulin chain
- Tubulin alpha-1 chain
- Alpha tubulin 1
- Alpha tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database .)
I don't care if the descriptions are different. I cannot throw them away because there is a possibility that the protein database that I use will not contain a list for a specific identifier. If this happens, I want you to be able to display a humanoid description for biologists so that they know something about what they are looking at.
I am currently solving this problem using a dictionary type. However, I do not really like this solution because it uses a lot of memory (I have a lot of these identifiers). This is only their intermediary list. There is some additional processing for the identifier that has passed before they are put into the database, so I would like to reduce my data structure.
I have two questions. Firstly, I will get less memory using the Set type (by dictionary type) for this, or I should use a sorted list where I check every time I insert into the list to see if the identifier exists, or if there is The third solution that I did not think about? Secondly, if the Set type is the best answer, how can I make it look only at the first element of the tuple, and not at all?
Thanks for reading my question,
Tim
Update
based on some comments I received, let me clarify a bit. Most of what I do with the data structure is inserted into it. I only read it twice, once to comment it with additional information * and once so that it can be inserted into the database. However, an additional annotation can be added along the line, which is performed before I insert into the database. Unfortunately, I do not know if this will happen at this time.
Now I am looking for data to be stored in a structure that is not based on a hash table (i.e., on a dictionary). I would like the new structure to be pretty fast on insertion, but reading it can be linear, since I only do it twice. I am trying to move away from a hash table to save space. Is there a better structure or hash table on how good it is?
* The information is a list of Swiss-Prot protein identifiers that I receive by requesting uniprot.