Improving python os.walk + regex algorithm performance

I use os.walk to select files from a specific folder that match a regular expression.

for dirpath, dirs, files in os.walk(str(basedir)): files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))] print dirpath, dirs, files 

But this should handle all the files and folders under control, which is quite time consuming. I am looking for a way to use the same regular expression used for files to filter out unwanted directories at every step of the walk. Or a way to match only part of a regular expression ...

For example, in a type structure

 /data/2013/07/19/file.dat 

using for example the following regular expression

 /data/(?P<year>2013)/(?P<month>07)/(?P<day>19)/(?P<filename>.*\.dat) 

find all .dat files without having to search, for example. / data / 2012

+4
source share
2 answers

If, for example, you only want to process files in /data/2013/07/19 os.walk() , just run os.walk() from the top /data/2013/07/19 os.walk() directory. This is similar to Tommi Komulainen's suggestion, but you don't need to change the loop code.

+1
source

I came across this problem (it’s pretty clear what the problem is, even if there is no urgent question), since no one answered, I think it can be useful, even if it's quite late.

You need to segment the source RE so that you can filter the intermediate directories inside the loop. Filter and then map the files.

 regex_parts = regex.split("/") del regex_parts[0] # Because [0] = "" it not needed for base, dirs, files in os.walk(root): if len(regex_parts) > 1: dirs[:] = [dir for dir in dirs if re.match(regex_parts[0], dir)] regex_parts[:] = regex_parts[1:] continue files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))] 

Since you match the files (the last part of the path), there is no reason to actually match until you filter out as much as possible. The len check is that directories that may correspond to the last part do not go astray. It could have been made more efficient, but it worked for me (I only had a similar problem today).

0
source

Source: https://habr.com/ru/post/1492308/


All Articles