Reading a text file in Perl word by word instead of line by line

I have a large (300KB) text file containing words separated by spaces. Now I want to open this file and process each word in it one by one.

The problem is that perl reads the file line by line (ie) the whole file at once, which gives me strange results. I know the usual way is to do something like

open($inFile, 'tagged.txt') or die $!; $_ = <$inFile>; @splitted = split(' ',$_); print $#splitted; 

But that gives me an erroneous number of words (too large an array?).

Can I read a text word instead of a word?

+6
source share
4 answers

To read a file one word at a time, change the input record separator ( $/ ) to a space:

 local $/ = ' '; 

Example:

 #!/usr/bin/perl use strict; use warnings; use feature 'say'; { local $/ = ' '; while (<DATA>) { say; } } __DATA__ one two three four five 

Output:

 one two three four five 
+4
source

Instead of reading it in one fell swoop, try a phased approach, which is easier to use your computer memory (although 300 KB is not too large for modern computers).

 use strict; use warnings; my @words; open (my $inFile, '<', 'tagged.txt') or die $!; while (<$inFile>) { chomp; @words = split(' '); foreach my $word (@words) { # process } } close ($inFile); 
+5
source

It's not clear what your input file looks like, but you mean that it contains only one line of many words.

300 KB away from the "big text file". You must read it in its entirety and pull out the words from there one by one. This program demonstrates

 use strict; use warnings; my $data = do { open my $fh, '<', 'data.txt' or die $!; local $/; <$fh>; }; my $count = 0; while ($data =~ /(\S+)/g ) { my $word = $1; ++$count; printf "%2d: %s\n", $count, $word; } 

Output

  1: alpha 2: beta 3: gamma 4: delta 5: epsilon 

Without an explanation of what could be a โ€œmistaken word count,โ€ it is very difficult to help, but it is certain that the problem is not with the size of your array: if a problem arose, then Perl would raise an exception and die.

But if you compare the result with statistics from a word processor, then this is probably because the definition of the word "word" is different. For example, a word processor might consider a two-digit word as two words.

+2
source

300K doesn't seem big, so you can try:

 my $text=`cat t.txt` or die $!; my @words = split /\s+/, $text; foreach my $word (@words) { # process } 

or slightly modified squiguy solution

 use strict; use warnings; my @words; open (my $inFile, '<', 'tagged.txt') or die $!; while (<$inFile>) { push(@words,split /\s+/); } close ($inFile); foreach my $word (@words) { # process } 
+1
source

Source: https://habr.com/ru/post/953370/


All Articles