Reading a text file in Perl word by word instead of line by line

Question

Reading a text file in Perl word by word instead of line by line

I have a large (300KB) text file containing words separated by spaces. Now I want to open this file and process each word in it one by one.

The problem is that perl reads the file line by line (ie) the whole file at once, which gives me strange results. I know the usual way is to do something like

open($inFile, 'tagged.txt') or die $!; $_ = <$inFile>; @splitted = split(' ',$_); print $#splitted;

But that gives me an erroneous number of words (too large an array?).

Can I read a text word instead of a word?

+6

perl

Johan wikström Sep 7 '13 at 1:48

source share

4 answers

Instead of reading it in one fell swoop, try a phased approach, which is easier to use your computer memory (although 300 KB is not too large for modern computers).

 use strict; use warnings; my @words; open (my $inFile, '<', 'tagged.txt') or die $!; while (<$inFile>) { chomp; @words = split(' '); foreach my $word (@words) { # process } } close ($inFile);

+5

squiguy Sep 7 '13 at 2:23

source share

It's not clear what your input file looks like, but you mean that it contains only one line of many words.

300 KB away from the "big text file". You must read it in its entirety and pull out the words from there one by one. This program demonstrates

 use strict; use warnings; my $data = do { open my $fh, '<', 'data.txt' or die $!; local $/; <$fh>; }; my $count = 0; while ($data =~ /(\S+)/g ) { my $word = $1; ++$count; printf "%2d: %s\n", $count, $word; }

Output

  1: alpha 2: beta 3: gamma 4: delta 5: epsilon

Without an explanation of what could be a “mistaken word count,” it is very difficult to help, but it is certain that the problem is not with the size of your array: if a problem arose, then Perl would raise an exception and die.

But if you compare the result with statistics from a word processor, then this is probably because the definition of the word "word" is different. For example, a word processor might consider a two-digit word as two words.

+2

Borodin Sep 7 '13 at 14:19

source share

300K doesn't seem big, so you can try:

 my $text=`cat t.txt` or die $!; my @words = split /\s+/, $text; foreach my $word (@words) { # process }

or slightly modified squiguy solution

 use strict; use warnings; my @words; open (my $inFile, '<', 'tagged.txt') or die $!; while (<$inFile>) { push(@words,split /\s+/); } close ($inFile); foreach my $word (@words) { # process }

+1

cur4so Sep 7 '13 at 19:20

source share

Robarl · Accepted Answer · 2013-09-07T08:28:43+0000

To read a file one word at a time, change the input record separator ( $/ ) to a space:

 local $/ = ' ';

Example:

 #!/usr/bin/perl use strict; use warnings; use feature 'say'; { local $/ = ' '; while (<DATA>) { say; } } __DATA__ one two three four five

Output:

 one two three four five

Reading a text file in Perl word by word instead of line by line

More articles: