Search and add concatenated strings

I have a file containing concatenated strings.

find_or_add(string) :

  • returns the offset of the line in the file (not necessarily the first)
  • adds as much line tail to the file as necessary so that the file contains the line (and then returns the line offset in the file).

pseudo code:

 file.init() // file == "" file.find_or_add("cat") // file == "cat", returns 0 file.find_or_add("able") // file == "catable", returns 3 file.find_or_add("table") // file == "catable", returns 2 file.find_or_add("tables") // file == "catables", returns 2 file.find_or_add("spigot") // file == "catablespigot", returns 7 file.find_or_add("pig") // file == "catablespigot", returns 8 

What algorithm / structure should be sought to "sum" this file in memory and allow the necessary operations to be no more than O (log N)?

Suppose the file is larger than RAM.

The language is not important, but I can read Pseudocode, C, Java, Python, Javascript and Haskell.

+4
source share
3 answers

If your inserts are small, you can create a suffix tree or a suffix array (using a lazy implementation). Since the inserts are <k you only need to build the tree to this depth, and the structure will occupy limited memory.

edit: if you need to save suffix identifiers (= integers), they will not fit into memory if the text does not apply unfortunately

The suffix tree (or the suffix array, which is more compact) then represents all the substrings of your text, and then you can do a simple search:

Is a substring in a tree?

Yes β†’ return the suffix (which is in the leaves of the tree).

No -> add it and add text to the source file.

I want to delve into this, but first I need to know about the size of the templates.

EDIT: note that the insert then only takes O (k) time!

EDIT2: If the templates are not limited in length, you may need to build a complete tree that is O (N) in space and time, the problem is that you usually have a coefficient> 10 bytes / char. Regards irW

+1
source

The suffix array and suffix tree are likely to cause memory problems. (they are always larger than the text, even if you cut them at a certain depth, since you need to store all the suffix identifiers in your structure).

You can create a set of files representing the identifiers of specific prefixes. Let's say we save all the prefixes of length 2 in another file and keep them sorted. This file will contain an average of 1/26 ^ 2 suffix identifiers. So, we have aa.txt, ab.txt and so on. Records in a file that we keep sorted (Suffix Array). Every time you want to perform a search, you use the download of this small file, which is already sorted and verified. The complexity will be O (N) (you need to upload a file that is a constant controlled part of your text), but you can tune the prefactor for better performance. For example, in a file with a size of 5 GB, if you use prefixes of length 2, then you will have a set of files of size 8 MB, for the prefix Length 3 - about 320 KB, etc.

+1
source

Perhaps this is not applicable, but this technology and algorithm has O (log N) search, fast insertion and is highly optimized for efficient IO with large data sets. I could be wrong, but it seems like a good balance between insert and search. What do you think?

0
source

Source: https://habr.com/ru/post/1492036/


All Articles