Python 3: the best way to iterate over all the lines in a large file (+1 million lines) in random order

Ok, so I have several text files, each of which contains more than 500,000 or even 1,000,000 lines.

I am currently doing something like this:

import random

def line_function(line):
    # Do something with given line

def random_itteration(filepath):
    with open(filepath) as f:
        lines = f.readlines()
        random.shuffle(lines)
        for line in lines:
            result = line_function(line)

The fact that Python Docs on random.shuffle()clearly stated (emphasis added by me):

Note that even for small len (x), the total number of permutations x can quickly grow more than the period of most random numbers generators. This means that most permutations of a long sequence can never be generated . For example, a sequence of length 2080 is the largest that can fit into the Mersenne Twister randomness period by a number generator.

, :

?

:

, line_function() , , . , .

, , , . , .


! Thnx .

+4
3

, , , . .

, - . , , . random.shuffle Mersenne Twister, .

, " ". random.shuffle .

+5

, .
( , / )
- :

import random
from random import randint

def line_function(line):
    # Do something with given line

def random_itteration(filepath):
    with open(filepath) as f:
        lines = f.readlines()
        count = len(lines)
        #random_index_list = random.shuffle(list(xrange(count)))
        random_index_list = random.sample(range(count+1),count)
        for index in random_index_list:
            result = line_function(lines[index])

        #shuffled_lines = random.shuffle(lines)
        #for line in shuffled_lines:
        #    result = line_function(line)
0

" " Python, , , Fisher-Yates.

, , . , , .

, , , lines = f.readlines() , , , , .

, , , ( ).

-1
source

Source: https://habr.com/ru/post/1694195/


All Articles