Need help developing a search algorithm in a more efficient way

I have a problem with the biology zone. Right now I have 4 VERY BIG files (each with 0.1 billion lines), but the structure is pretty simple, each line of these files has only 2 fields, both indicate the type of gene.

My goal: to develop an effective algorithm that can provide the following: Find a circle in the contents of these 4 files. The circle is defined as:

field #1 in a line in file 1 == field #1 in a line in file 2 and field #2 in a line in file 2 == field #1 in a line in file 3 and field #2 in a line in file 3 == field #1 in a line in file 4 and field #2 in a line in file 4 == field #2 in a line in file 1 

I can’t come up with a decent way to solve this problem, so for now I just wrote a nested loop with brute force-dumb-4-layer. I am thinking of sorting them alphabetically, even if it might help a little, but then it is also obvious that computer memory will not allow me to download everything at once. Can someone tell me a good way to solve this problem both in time mode and in space? Thanks!!

+6
source share
4 answers

First of all, I note that you can sort a file without saving all its memory, and that most operating systems have a program that does this, which is often called simply β€œsorting”. You can usually get it to sort by field inside the file, but if not, you can rewrite each line to sort it as you wish.

Given this, you can connect two files by sorting them so that the first is sorted in field # 1, and the second in field # 2. Then you can create one record for each match, combining all the fields and storing a piece of each file in memory , where all the fields you sorted have the same value. This will allow you to link the result to another file - four such connections should solve your problem.

Depending on your data, the time required to resolve your problem may depend on the order in which you make connections. One naive way to use this at every step is to take a small random sample from each file and use it to find out how many results will result from each possible connection, and select the connection that gives the least amount of results. One way to take a random sample of N elements from a large file is to take the first N lines in the file, and then, when you have read m lines so far, read the next line and then exchange with probability N / (m + 1) one out of N rows held for her, otherwise throw it away. Continue until you read the entire file.

+1
source

Here is one algorithm:

  • Choose the appropriate search structure: if field # 1 is an integer, use bit fields or a dictionary (or set) if it is a string; Use a search structure for each file, i.e. 4 in your case
  • Initialization phase: for each file: analyze the file line by line and set the corresponding bit in the bit field or add the field to the dictionary in the corresponding search structure for the file.
  • After initializing the search structure above check the condition in your question.

The complexity of this depends on the implementation of the search structure. For bit fields, this will be O (1), and for a set or dictionary, it will be O (lg (n)), since they are usually implemented as a balanced search tree. The total complexity will be O (n) or O (n lg (n)); The solution to the question is O (n ^ 4)

You can get the code and solution for bit fields from here

NTN

0
source

Here is one approach:

We will use the notation Fxy, where x = field number, y = file_no

Sort each of the 4 files in the first fields.

For each F11 field, find a match in file 2. This will be linear. Save these matches with all four fields in a new file. Now use this file and use the corresponding field in this file and get all matches from file3. Go to file 4 and go back to file1.

Thus, as you move to each new file, you are dealing with fewer lines. And since you sorted the files, do a linear search and you can do this by reading from disk.

Here's the difficulty in O (n log n) for sorting and O (m log n) for searching, assuming m <n.

0
source

It's a little easier to explain if your file is 1 in a different way (so every second element points to the first element in the next file).

  • Start with file 1, copy it to a new file, writing each pair A, B as B, A, 'REV'

  • Add the contents of file 2 so that it writes each pair A, B as A, B, 'FWD'

  • File sorting

  • Process file in chunks with the same initial value

    • Inside this group of line fragments in REV and FWD
    • Take the Cartesian product of revolutions and fwds (nested loop)
    • Write a line with the reverse (fwd) concat (rev) excluding the duplicate token
    • eg. B, A, 'REV' and B, C, 'FWD' β†’ C, B, A, 'REV'
  • Add the following file to this new output file (by adding 'FWD' to each line)

  • Repeat step 3

Essentially, you chain in the reverse order and use a file-based sorting algorithm to combine sequences of sequences.

Of course, it would be even easier to just read these files in the database and let it do the work ...

0
source

Source: https://habr.com/ru/post/896939/


All Articles