Create binary search alphabetically .txt file in C

I am working on creating a binary search algorithm in C that searches for a string in a TXT file. Each line is a line representing a stock ticker. Not familiar with C, it's too long. I have a few questions:

1.) As soon as I opened the file using fopen, does it make sense to use the algorithm to go through the file using some function provided in the C library to scan files, doing a comparison directly with the file, or should I copy each line into an array and search algorithm in an array?

2.) If I have to compare directly with a file, what is the best way to get through it? Suppose I have the number of lines in a file, is there a way to go directly to the midline, scan the line and do the comparison?

I apologize if this is too vague. Not too sure how best to explain. thank you for your time

+4
source share
5 answers

If your file is extremely large (> 2 GB), then loading the file into memory before searching is the way to go. If you cannot load the file into memory, you can save the offset of each line in int[] or (if the file contains too many lines ...) create another binary file and write the offset of each line as integers ...

The presence of everything in memory is certainly preferable.

+2
source

You cannot binary search for lines of a text file without knowing the length of each line in advance, so you most likely want to read each line in memory first (if the file is not very large).

But if your goal is to search for one given string as quickly as possible, you can simply do a linear search directly in the file. It makes no sense to get O (log n) at the cost of setting O (n) if the search is performed only once.

+2
source

Reading all of this with mass reading and passing through it with pointers (to memory) is very fast. Avoid making multiple I / O calls if you can.

I should also mention that memory mapped files can be very suitable for something like that. See mmap () if on Unix. This is by far the best choice for really large files.

+1
source

This is a great question!

The task of binary search is that the advantages of binary search are the ability to skip half the elements at each stage in O (1). This ensures that since you are only using O (log n) probes, the runtime is O (log n). That is why, for example, you can do a quick binary search in an array, but an unrelated list. In a linked list, finding half the point of elements takes linear time, which dominates the search time.

When performing a binary search in a file, you are in a similar position. Since all lines in a file may not have the same length, you cannot easily go to the nth line in a file with some number n. Therefore, implementing a good fast binary search in a file will be a bit more complicated. One way or another, you will need to know where each line starts and stops so that you can efficiently jump to the file.

There are many ways to do this. First, you can load all the lines from a file into an array, as you suggested. This takes linear time, but as soon as you have an array of strings in memory, all future binary searches will be very fast. The trick is that if you have a very large file, it can take up a lot of memory and can be overly expansive. Therefore, another alternative might not be storing the actual bites in an array, but rather an offset to the file in which each line occurs. This will allow you to quickly perform a binary search - you can search for a file with the appropriate offset when comparing - and for large bites it can be much more economical than indicated above. And, if all the lines are the same length, you can simply lay each line to a certain fixed size to allow direct calculation of the starting position of each line.

If you want to spend some time introducing more complex solutions, you can think about preprocessing the file so that instead of having one line per line, instead you have a list of fixed width integers at the top of the file containing the offsets of each line in file. This essentially does the job above, but then saves the result in a file to make future binary queries much faster. I have some experience with this kind of file structure, and it can be pretty fast.

If you are REALLY for a call, you can alternatively store the lines in a file using the B-tree, which would give you an incredibly fast search time for each line, minimizing the amount of disk reading you need to do.

Hope this helps!

+1
source

I do not see how you can compare directly from a file. You must have a buffer to store data read from disk and use this buffer. So it does not make sense, it is simply impossible.

You cannot go to a specific line in a file. If you do not know the offset in bytes of the beginning of this line relative to the beginning of the file.

I would recommend using mmap to map this file directly in memory and work with it as an array of characters. The operating system will make working with the file (for example, search, read, write) transparent for you, and you will simply work with it as if it were a buffer in memory. Please note that mmap limited to 4 GB on 32-bit systems. But if this file is larger, you probably need to ask a question - why does anyone actually have this large file not in an indexed database.

0
source

Source: https://habr.com/ru/post/1338154/


All Articles