The outermost set of parentheses around your entire drawing is captured at $1 , which is clearly not intended. In addition, greed .*\/ Means that it takes everything to the last / . Similarly,. .*\s+ leaves only the very last space.
One way to do this is to use a negative character class
my ($word, $tag) = m{ ([^/\s]+) / ([^/\s]+) }x;
The pattern [^/\s]+ matches a string of one or more consecutive characters, each of which is any, except / or spaces. Thus, you get a "word" before and after / . If you take "everything after the slash", as the text says it is unclear what should be before the next slash.
Your approach may now look like
while (my $line = <$fh>) { while ( $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx ) { $tagHash{$2}{$1}++; } }
Another account seems unrelated, so I left it to focus on the question.
However, there is a bit missing.
This approach cannot determine when the string is different than expected. for instance
word1 / tag1 word2 / tag2 / tag3 / word4 / tag4
produces the wrong results, quietly. Some violations are overlooked, but there are many bad cases.
One way to catch this is to pre-process the string by checking that there are at least two words between all slashes and at least one before the first and last. This means that each line is processed twice, and also becomes more messy. for instance
while (my $line = <$fh>) { my @parts = split '/', $line; if (not shift @parts or not pop @parts or grep { 2 > split } @parts) { warn "Unexpected format: $line"; next; } $tagHash{$2}{$1}++ while $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx; }
This check modifies the @parts array, so if this array is needed later, you'd better use
if (!$parts[0] or !$parts[-1] or grep { 2 > split } @parts[ 1..@parts-2 ]) { ...
where instead of grep you can also use the short circuit any from List :: Util
Another way is to change the approach and carefully analyze the line, rather than blindly jumping onto matches of regular expressions. Since the first and last can only have one word, it can be difficult to do with a regular expression. Probably more clear and practical to split and work with the array.
It is difficult to imagine a format that always matches the data, so I would suggest considering some of them.