What is the best way to store set data in Python?

I have a list of data in the following form:

[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))

Data is downloaded from files belonging to the same group. Each group can have multiples of the same identifier, each of which comes from different files. I don't care about duplicates, so I thought that a good way to preserve all this would be to throw it into a Set type. But there's a problem.

Sometimes for the same identifier descriptions may vary slightly as follows:

IPI00110753

  • Alpha-1A tubulin chain
  • Tubulin alpha-1 chain
  • Alpha tubulin 1
  • Alpha tubulin isotype M-alpha-1

(Note that this example is taken from the uniprot protein database .)

I don't care if the descriptions are different. I cannot throw them away because there is a possibility that the protein database that I use will not contain a list for a specific identifier. If this happens, I want you to be able to display a humanoid description for biologists so that they know something about what they are looking at.

I am currently solving this problem using a dictionary type. However, I do not really like this solution because it uses a lot of memory (I have a lot of these identifiers). This is only their intermediary list. There is some additional processing for the identifier that has passed before they are put into the database, so I would like to reduce my data structure.

I have two questions. Firstly, I will get less memory using the Set type (by dictionary type) for this, or I should use a sorted list where I check every time I insert into the list to see if the identifier exists, or if there is The third solution that I did not think about? Secondly, if the Set type is the best answer, how can I make it look only at the first element of the tuple, and not at all?

Thanks for reading my question,
Tim

Update

based on some comments I received, let me clarify a bit. Most of what I do with the data structure is inserted into it. I only read it twice, once to comment it with additional information * and once so that it can be inserted into the database. However, an additional annotation can be added along the line, which is performed before I insert into the database. Unfortunately, I do not know if this will happen at this time.

Now I am looking for data to be stored in a structure that is not based on a hash table (i.e., on a dictionary). I would like the new structure to be pretty fast on insertion, but reading it can be linear, since I only do it twice. I am trying to move away from a hash table to save space. Is there a better structure or hash table on how good it is?

* The information is a list of Swiss-Prot protein identifiers that I receive by requesting uniprot.

+4
source share
6 answers

Sets have no keys. The item has a key.

If you think you want keys, you have a mapping. More or less by definition.

Sequential list searches can be slow, even using binary search. Mappings use hashes and fast.

Are you talking about such a dictionary?

 { 'id1': [ ('description1a', 'type1'), ('description1b','type1') ], 'id2': [ ('description2', 'type2') ], ... } 

It seems minimal. IDs are displayed only once.

Perhaps you have something like this?

 { 'id1': ( ('description1a', 'description1b' ), 'type1' ), 'id2': ( ('description2',), 'type2' ), ... } 

I'm not sure that you can find anything more compact if you don't resort to using the struct module.

+1
source

I assume that the problem you are trying to solve while reducing the amount of memory you are using is limiting the address space of your process. You are also looking for a data structure that allows you to quickly insert and intelligently read sequential data.

Use fewer structures except strings (str)

The question you ask is how to structure your data in one process in order to use less memory. One canonical answer to this question (as long as you still need an associative search), use as few other structures as possible than python strings (str, not unicode) as much as possible. The python hash (dictionary) keeps references to your lines quite efficiently (this is not a b-tree implementation).

However, I think that you will not be able to get close to this approach, since what you are facing is huge data sets that can ultimately simply exceed the address space of the process and the physical memory of the machine you are working with at all.

Alternative solution

I would suggest another solution that is not related to changing the data structure to something that is more difficult to insert or interpret.

  • Share your information in several processes, each of which holds any data structure for this.
  • Implement interaction between processes with sockets so that processes can reside on other machines in general.
  • Try to split your data, for example, to minimize interaction between processes (i / o is slower compared to processor cycles).

The advantage of the approach I am talking about is that

  • You can use two ores for more cores on a machine for performance
  • You are not limited by the address space of one process or even the physical memory of one computer.

There are many packages and aproaches for distributed processing, some of which

+1
source

If you are merging n-way with removing duplicates, the following may be required.

This generator will combine any number of sources. Each source must be a sequence. The key should be at position 0. It gives the combined sequence one element at a time.

 def merge( *sources ): keyPos= 0 for s in sources: s.sort() while any( [len(s)>0 for s in sources] ): topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ]) top= [ t for t in topEnum if t[1] is not None ] top.sort( key=lambda a:a[1] ) src, key = top[0] #print src, key yield sources[ src ].pop(0) 

This generator removes duplicates from the sequence.

 def unique( sequence ): keyPos= 0 seqIter= iter(sequence) curr= seqIter.next() for next in seqIter: if next[keyPos] == curr[keyPos]: # might want to create a sub-list of matches continue yield curr curr= next yield curr 

Here is a script that uses these functions to create the resulting sequence, which is the union of all sources with deleted duplicates.

 for u in unique( merge( source1, source2, source3, ... ) ): print u 

A complete set of data in each sequence must exist in memory once, because we sort by memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming other sequences.

+1
source

How about using the dictionary {id: (description, id_type)} ? Or {(id, id_type): description} if the key (id, id_type) is the key.

0
source

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but this changed AFAIK. The only thing you saved with the set will then be the size of the pointer for each record (pointer to the value).

To use only part of the tuple for the hash code, you will have to subclass the tuple and override the hashcode method:

 class ProteinTuple(tuple): def __new__(cls, m1, m2, m3): return tuple.__new__(cls, (m1, m2, m3)) def __hash__(self): return hash(self[0]) 

Keep in mind that you pay for an extra call to the __hash__ function in this case, because otherwise it will be method C.

I would go to Konstantin's suggestions and take out the identifier from the tuple and see how this helps.

0
source

It's still dark, but it looks like you have multiple lists [(id, description, type) ...]

The identifier is unique in the list and is consistent between the lists.

You want to create UNION: one list, where each identifier happens once, possibly with several descriptions.

For some reason, you think the comparison might be too big. Do you have evidence of this? Do not over-optimize without actual measurements.

It can be (if I guess correctly) the standard merge operation from several sources.

 source1.sort() source2.sort() result= [] while len(source1) > 0 or len(source2) > 0: if len(source1) == 0: result.append( source2.pop(0) ) elif len(source2) == 0: result.append( source1.pop(0) ) elif source1[0][0] < source2[0][0]: result.append( source1.pop(0) ) elif source2[0][0] < source1[0][0]: result.append( source2.pop(0) ) else: # keys are equal result.append( source1.pop(0) ) # check for source2, to see if the description is different. 

This combines the merging of two lists by sorting and merging. No matching, no hash.

0
source

Source: https://habr.com/ru/post/1277116/


All Articles