Faster file searches in Perl

I have a problem when my current algorithm uses a naive linear search algorithm to extract data from multiple data files via the corresponding lines.

This is something like this (pseudo code):

while count < total number of files open current file extract line from this file build an arrayofStrings from this line foreach string in arrayofStrings foreach file in arrayofDataReferenceFiles search in these files close file increment count 

For a big real life, the process can take about 6 hours.

Basically, I have a large set of strings that the program uses to search the same set of files (for example, 10 in 1 instance and maybe 3 in the next instance of the program). Since the referenced data files are subject to change, I don’t think it is wise to create a constant index of these files.

I'm almost new and don't know about faster methods for unsorted data.

I thought, since the search becomes repeated after some time, is it possible to pre-create the location index of certain lines in the data data files without using any external perl libraries after creating the file array (files are known)? This script will be ported to a server on which perhaps only standard Perl is installed.

I figured it would take 3-5 minutes to create a kind of index for the search before processing the work.

Is there a specific indexing / search concept that applies to my situation?

Thanks everyone!

+4
source share
3 answers

It is hard to understand what exactly you are trying to achieve.

I assume that the dataset does not fit into RAM.

If you are trying to match each line in many files with a set of templates, it is best to read each line at a time and then match it with all the templates while they are in memory before moving on. This will reduce the I / O cycle for each pattern.

On the other hand, if coincidence is something that takes time, when you are probably better off using a library that can combine multiple patterns at the same time.

+3
source

Perhaps you can replace this:

 foreach file in arrayofDataReferenceFiles search in these files 

with a preprocessing step to create a DBM file (i.e. a hash on disk) as an inverse index that maps each word in your link files to a list of files containing that word (or whatever you need). The Perl core includes DBM support :

dbmopen HASH, DBNAME, MASK

This links dbm (3), ndbm (3), sdbm (3), gdbm (3), or the Berkeley DB file with a hash.

Usually you access this material through tie , but it doesn’t matter, each Perl should have some support for at least one hash on the disk, without the need to install non-core packages.

+1
source

As MarkR noted, you want to read each line from each file no more than once. The pseudocode that you posted looks like you are reading each line of each file several times (once for each word you are searching for), which will significantly slow down your work, especially for large searches. Reversing the order of two inner loops should (judging by the published pseudocode) fix this.

But, also, you said: "Since the referenced data files may change, I don’t think it makes sense to create a constant index of these files." This is most likely incorrect. If performance is a problem (if you get a 6-hour battery life, I would say that this is probably a concern), and on average, each file is read more than once between changes in that particular file, and then an index is created on disk (or even ... using a database!) would be a very reasonable task. Disk space is very cheap these days; the time that people spend waiting for results is not.

Even if files often undergo several changes without reading, on-demand indexing (when you want to check a file, first see if the index exists, and if not, build it before you start the search) would be a great approach - when you search for a file more than once, you use the index ; when this is not the case, first create an index, and then a search with an index will be slower than a linear search, with such a small margin that it does not really matter.

+1
source

Source: https://habr.com/ru/post/1383382/


All Articles