How to work with generators from a file for tokenization, rather than materializing a list of lines?

I have 2 files:

hyp.txt

It is a guide to action which ensures that the military always obeys the commands of the party he read the book because he was interested in world history 

ref.txt

 It is a guide to action that ensures that the military will forever heed Party commands he was interested in world history because he read the book 

And I have a function that does some calculations to compare lines of text, for example. line 1 of the hyp.txt file with line 1 of the ref.txt file.

 def scorer(list_of_tokenized_hyp, list_of_tokenized_ref): """ :type list_of_tokenized_hyp: iter(iter(str)) :type list_of_tokenized_ref: iter(iter(str)) """ for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref): # do something with the iter(str) return score 

And this function cannot be changed. However, I can manipulate what I pass to the functions. Therefore, I am currently loading a file into a function as follows:

 with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin: hyp = [line.split() for line in hypfin] ref = [line.split() for line in reffin] scorer(hypfin, reffin) 

But in doing so, I loaded the entire file and the separation line into memory before loading it into scorer() .

Know that scorer() processes files line by line, is there a way to not materialize the shared line before submitting it to the function without changing the function scorer() ?

Is there a way to feed into some kind of generator instead?

I tried this:

 with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin: hyp = (h.split() for h in hypline) ref = (r.split() for r in hypline) scorer(hypfin, reffin) 

but I'm not sure if h.split() materialized. If it was materialized, why? If not, why?

If I could change the scorer() function, then I could easily add this line after for :

 def scorer(list_of_tokenized_hyp, list_of_tokenized_ref): for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref): hypline = hypline.split() refline = refline.split() # do something with the iter(str) return score 

But this is not possible in my case, since I cannot change this function.

+5
source share
2 answers

Your generator expressions in combination with Python 3 zip() (replace itertools.izip() in Python 2) behave as you need, i.e. they do not read the entire file to create split lists at one time.

You can get an idea of ​​what is going on by replacing the version of the str.split() log:

 def my_split(s): print('my_split(): {!r}'.format(s)) return s.split() >>> hypfin = open('hyp.txt', 'r') >>> reffin = open('ref.txt', 'r') >>> hyp = (my_split(h) for h in hypfin) # NB my_split() not called here >>> hyp <generator object <genexpr> at 0x7fa89ad16b40> >>> ref = (my_split(r) for r in reffin) # NB my_split() not called here >>> ref <generator object <genexpr> at 0x7fa89ad16bd0> >>> z = zip(hyp, ref) # NB my_split() not called here >>> z <zip object at 0x7fa89ad15cc8> >>> hypline, refline = next(z) my_split(): 'It is a guide to action which ensures that the military always obeys the commands of the party\n' my_split(): 'It is a guide to action that ensures that the military will forever heed Party commands\n' >>> hypline, refline = next(z) my_split(): 'he read the book because he was interested in world history\n' my_split(): 'he was interested in world history because he read the book\n' >>> hypline, refline = next(z) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration 

From the output of my_split() you can see that hyp and ref are indeed generators that do not consume input until they are needed. z is a zip object that also does not consume any input until access is available. The for loop is modeled using next() to demonstrate that only one line of input from each file is consumed at each iteration.

+2
source

Yes, your example defines two generators

 with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin: hyp = (h.split() for h in hypfin) ref = (r.split() for r in reffin) scorer(hyp, ref) 

and split , and the corresponding reading of the next line is performed for each iteration of the loop.

+3
source

Source: https://habr.com/ru/post/1239748/


All Articles