Monte Carlo
Alternatively, reading line by line line by line *
(* use the David Robinson method to read the gzip file as a standard file):
If all lines are approximately the same size, you can go to an arbitrary position in the file, return a character by character until you reach a new line and read the full line from this point. If the lines are exactly the same size, this method is accurate.
If, however, the lines do not have the same size, but you know the distribution of length with length x - you can make the method as described above, but reject overabundant x with probability P(x) so that the probability of capturing a random line in the file is constant.
Example:
To make this simple, let's say you have a 5-line file with a length of X={2,3,5,5,5} . When choosing a random point in a file, you have a 10% chance (2 / (+ + + 5 + 5 + 5 + 5)) to get x1 , 15% to get x2 , 50% chance to x3 . What you need is a 20%/20%/60% chance. The corresponding weights are W=(3/2, 1, 6/5) , these are numbers such that x1*w1 = 20% , x2*w2 = 20% , x3*w3=60% . The normalizing factor is the sum of these weights Z = w1+w2+w3 = 37/10 . From here we know the probability of each of the lines:
P(w1) = w1/Z = 30/68 P(w2) = w2/Z = 20/68 P(w3) = w3/Z = 18/68
Note that P(w1)+P(w2)+3*P(w3)=1 , as it should be.
For your algorithm, select a random point in the file. If the corresponding line has a length of 2, select a random number between q=[0,1] . If q>(30/68) reject this place and try again. If it stops less and returns this string.
When do you know X(w) ?
I agree that knowing the exact distribution of line lengths may seem restrictive, but there are many files processed by the procedure (log files, reading hardware data, etc.) where the distribution is known for sure. In addition, if the distribution is only known approximately, we can use the method above to determine the rejection criteria of the sample as the best assumption and move on from there.
Monte Carlo?
This may not be the best method (who can compete with Knut?), But it may offer some insight to solve the problem in a completely different way. For those who are unfamiliar, the method above is a form of sampling importance, a method
This has a conclusion for running the sample as:
Position 8, char: ['c'] Position 23, char: ['e'] Position 15, char: ['d'] Position 10, char: ['c'] Position 4, char: ['b'] Position 16, char: ['d'] Position 2, char: ['\n']