Efficient data structure for linking data to file system paths?

I need to store some data in memory of as many files and directories as possible (usually up to several hundred thousand). The obvious approach is to use Dictionary<string, Something> with the outline as the key, but there are two problems with this:

  • many of the files will have most of the common path, so it’s probably a waste of memory to store the full path to each file.
  • I need to be able to quickly access data about all descendants of the directory; with a dictionary, the only way to do this is to check each key and see if it starts from the specified path, which is pretty inefficient.

This problem seems to be a good candidate for using the prefix tree (or trie ), and the path segments are like “characters”. I tried to implement it, and the performance is not so bad for searching by prefix (about 4 times faster than the dictionary), but he has two problems:

  • memory consumption does not decrease, probably due to the overhead of the list of children for each node
  • build time is much worse than with the dictionary (about 4 times slower to fill the collection).

I'm sure this should be a very common problem, so maybe there are well-known solutions that I don't know about?

+4
source share
2 answers

A few general ideas:

Firstly, Patricia trie is probably the best known approach for improving memory consumption by trying - it combines paths in which all nodes have one child in one node and combine characters along the path. There is also a version in which you view data as a sequence of binary digits, which has the advantage that you always have no more than 2 child nodes, and is also easier to implement.

Secondly, memory consumption really depends on how you store the children of this node - do you support an array of 256 nodes? This is usually the most efficient direct search method, but also consumes most of the memory and slow if you need to sort through all the children. Other options:

  • Saving an array of pairs (letter, child node) is perhaps the most efficient memory, since it only stores those objects that you are really interested in, and also has good performance for iterating through all the children. However, you should check all pairs for a direct search - which is usually different from the root, but can be a problem near the root.

  • Keep some kind of dictionary inside each node that maps the letter to the child element of the node. This is the most balanced in terms of performance - it gives you good enough speeds for all operations and has some efficiency in working with memory.

In addition, if you build the entire collection forward and then simply query it, there is an approach for storing child links based on Tarjan tables , which will probably increase the build time, but later save the memory and query time.

+2
source

How about a prefix tree similar to an approach. If you want to save

 /root/x /root/a/b /root/a/c /root/a/d /root/a/e /root/a/c/e /root/a/c/f Here is how your tree will look like. root / \ x __ a __ / / \ \ bcde / \ ef 

This will be effective since each directory name will be stored only once. Also the search as well as the insert will be O (log (n))

-1
source

Source: https://habr.com/ru/post/1444079/


All Articles