How to work with generators from a file for tokenization, rather than materializing a list of lines?

Question

How to work with generators from a file for tokenization, rather than materializing a list of lines?

I have 2 files:

hyp.txt

It is a guide to action which ensures that the military always obeys the commands of the party he read the book because he was interested in world history

ref.txt

 It is a guide to action that ensures that the military will forever heed Party commands he was interested in world history because he read the book

And I have a function that does some calculations to compare lines of text, for example. line 1 of the hyp.txt file with line 1 of the ref.txt file.

 def scorer(list_of_tokenized_hyp, list_of_tokenized_ref): """ :type list_of_tokenized_hyp: iter(iter(str)) :type list_of_tokenized_ref: iter(iter(str)) """ for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref): # do something with the iter(str) return score

And this function cannot be changed. However, I can manipulate what I pass to the functions. Therefore, I am currently loading a file into a function as follows:

 with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin: hyp = [line.split() for line in hypfin] ref = [line.split() for line in reffin] scorer(hypfin, reffin)

But in doing so, I loaded the entire file and the separation line into memory before loading it into scorer() .

Know that scorer() processes files line by line, is there a way to not materialize the shared line before submitting it to the function without changing the function scorer() ?

Is there a way to feed into some kind of generator instead?

I tried this:

 with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin: hyp = (h.split() for h in hypline) ref = (r.split() for r in hypline) scorer(hypfin, reffin)

but I'm not sure if h.split() materialized. If it was materialized, why? If not, why?

If I could change the scorer() function, then I could easily add this line after for :

 def scorer(list_of_tokenized_hyp, list_of_tokenized_ref): for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref): hypline = hypline.split() refline = refline.split() # do something with the iter(str) return score

But this is not possible in my case, since I cannot change this function.

+5

python generator string list split

alvas Jan 03 '15 at 10:55

source share

2 answers

Yes, your example defines two generators

 with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin: hyp = (h.split() for h in hypfin) ref = (r.split() for r in reffin) scorer(hyp, ref)

and split , and the corresponding reading of the next line is performed for each iteration of the loop.

+3

Daniel Jan 03 '15 at 23:31

source share

mhawke · Accepted Answer · 2016-01-04T01:30:32+0000

Your generator expressions in combination with Python 3 zip() (replace itertools.izip() in Python 2) behave as you need, i.e. they do not read the entire file to create split lists at one time.

You can get an idea of what is going on by replacing the version of the str.split() log:

 def my_split(s): print('my_split(): {!r}'.format(s)) return s.split() >>> hypfin = open('hyp.txt', 'r') >>> reffin = open('ref.txt', 'r') >>> hyp = (my_split(h) for h in hypfin) # NB my_split() not called here >>> hyp <generator object <genexpr> at 0x7fa89ad16b40> >>> ref = (my_split(r) for r in reffin) # NB my_split() not called here >>> ref <generator object <genexpr> at 0x7fa89ad16bd0> >>> z = zip(hyp, ref) # NB my_split() not called here >>> z <zip object at 0x7fa89ad15cc8> >>> hypline, refline = next(z) my_split(): 'It is a guide to action which ensures that the military always obeys the commands of the party\n' my_split(): 'It is a guide to action that ensures that the military will forever heed Party commands\n' >>> hypline, refline = next(z) my_split(): 'he read the book because he was interested in world history\n' my_split(): 'he was interested in world history because he read the book\n' >>> hypline, refline = next(z) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration

From the output of my_split() you can see that hyp and ref are indeed generators that do not consume input until they are needed. z is a zip object that also does not consume any input until access is available. The for loop is modeled using next() to demonstrate that only one line of input from each file is consumed at each iteration.

How to work with generators from a file for tokenization, rather than materializing a list of lines?

More articles: