How to get random string from gzip compressed file in python without reading it in memory

Question

How to get random string from gzip compressed file in python without reading it in memory

Let's say I have a text file with a size of 531 gzipped with exactly 512 548 457 601 475 lines separated by "\ n", and wanted to get an arbitrary line from it without files. (Don’t worry, it’s not so much, I just wanted to say that it is a huge file, and I know how many lines it has.)

As I usually do this with a smaller compressed file:

import fileinput import gzip import random list = [] for line in fileinput.input(file, openhook=gzip.open): list.append(line) listLength = len(list) randomListLineOne = line[random.randint(0, listLength)] randomListLineTwo = line[random.randint(0, listLength)] ...

What I found on the topic:

How to read a random line from a single file in python?

 import random def random_line(afile): line = next(afile) for num, aline in enumerate(afile): if random.randrange(num + 2): continue line = aline return line

Waterman "Reservoir Algorithm", translated by Alex Martelli from Knut, "The Art of Computer Programming"

Could you adapt this for compressed files? I tried to configure my compressed file as a file, but that did not work. Or is there another (simpler) way to achieve this?

+4

python memory random gzip line

James Rather Feb 16 '12 at 18:53

source share

3 answers

You can just use "read a random line from a single file in Python", but open the file as a gzip file rather than a regular file using gzip .

 import gzip import random def random_line(afile): line = next(afile) for num, aline in enumerate(afile): if random.randrange(num + 2): continue line = aline return line afile = gzip.open("myfile.zip") print random_line(afile) afile.close()

+2

David robinson Feb 16 '12 at 19:05

source share

Forgive the (very) late answer, but you can use the seek() method to place in the file if you know the file size from gunzip -l .
Then discard the next read, as it is likely to be a partial line and will use the subsequent read as your random data.

Print 10 random lines from a gzipped-enabled text file.

 import random import gzip, os f = gzip.open("some.txt.gz","r") unc_size = os.popen('gunzip -lq some.txt.gz').read() unc_size = unc_size.strip().split(" ",1) unc_size = unc_size[1].strip().split(" ",1) for x in range(1,11): f.seek(random.randint(0,int(unc_size[0]))) dump = next(f) print "Random line from byte pos ",f.tell(), next(f) f.close()

0

Rolf of saxony Feb 06 '16 at 16:31

source share

Hooked · Accepted Answer · 2012-02-16T22:55:21+0000

Monte Carlo

Alternatively, reading line by line line by line *

(* use the David Robinson method to read the gzip file as a standard file):

If all lines are approximately the same size, you can go to an arbitrary position in the file, return a character by character until you reach a new line and read the full line from this point. If the lines are exactly the same size, this method is accurate.

If, however, the lines do not have the same size, but you know the distribution of length with length x - you can make the method as described above, but reject overabundant x with probability P(x) so that the probability of capturing a random line in the file is constant.

Example:

To make this simple, let's say you have a 5-line file with a length of X={2,3,5,5,5} . When choosing a random point in a file, you have a 10% chance (2 / (+ + + 5 + 5 + 5 + 5)) to get x1 , 15% to get x2 , 50% chance to x3 . What you need is a 20%/20%/60% chance. The corresponding weights are W=(3/2, 1, 6/5) , these are numbers such that x1*w1 = 20% , x2*w2 = 20% , x3*w3=60% . The normalizing factor is the sum of these weights Z = w1+w2+w3 = 37/10 . From here we know the probability of each of the lines:

  P(w1) = w1/Z = 30/68 P(w2) = w2/Z = 20/68 P(w3) = w3/Z = 18/68

Note that P(w1)+P(w2)+3*P(w3)=1 , as it should be.

For your algorithm, select a random point in the file. If the corresponding line has a length of 2, select a random number between q=[0,1] . If q>(30/68) reject this place and try again. If it stops less and returns this string.

When do you know X(w) ?

I agree that knowing the exact distribution of line lengths may seem restrictive, but there are many files processed by the procedure (log files, reading hardware data, etc.) where the distribution is known for sure. In addition, if the distribution is only known approximately, we can use the method above to determine the rejection criteria of the sample as the best assumption and move on from there.

Monte Carlo?

This may not be the best method (who can compete with Knut?), But it may offer some insight to solve the problem in a completely different way. For those who are unfamiliar, the method above is a form of sampling importance, a method

This has a conclusion for running the sample as:

 Position 8, char: ['c'] Position 23, char: ['e'] Position 15, char: ['d'] Position 10, char: ['c'] Position 4, char: ['b'] Position 16, char: ['d'] Position 2, char: ['\n']

How to get random string from gzip compressed file in python without reading it in memory

Monte Carlo

More articles: