Parsing large, possibly compressed files in Python

I am trying to parse a large file, line by line, to get relevant information. I can get either uncompressed or gzip file (I may have to edit the zip file at a later stage).

I use the following code, but I feel that since I am not inside the with statement, I do not parse the file line by line and actually load the entire file_content file into memory.

 if ".gz" in FILE_LIST['INPUT_FILE']: with gzip.open(FILE_LIST['INPUT_FILE']) as input_file: file_content = input_file.readlines() else: with open(FILE_LIST['INPUT_FILE']) as input_file: file_content = input_file.readlines() for line in file_content: # do stuff 

Any suggestions on how I should handle this? I would prefer not to unzip the file outside the code block, as this should be shared, and I would have to remove several files.

+5
source share
1 answer

readlines reads the entire file. So this is not for large files.

Executing 2 context blocks as you do, and then using the input_file descriptor outside of them does not work (operation in a closed file).

To get the best of both worlds, I would use a ternary conditional for the context block (which determines whether to use open or gzip.open ), and then iterate through the lines.

 open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file: for line in input_file: 

Notice that I added the "r" mode to work with non-binary text ( gzip.open defaults to binary)

Alternative: open_function can be made common, so it does not depend on FILE_LIST['INPUT_FILE'] :

 open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f) 

after determining, you can reuse it as desired

 with open_function(FILE_LIST['INPUT_FILE']) as input_file: for line in input_file: 
+5
source

Source: https://habr.com/ru/post/1271057/


All Articles