Grep / regex cannot find accented word

Question

Grep / regex cannot find accented word

I am trying to set a regular expression that contains several words in a file, where all the letters of this word match the pattern of words.

My problem is that the regular expression cannot find words with an accent, but there are a lot of accented words in my text file.

My command line is:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt

And the contents of the file:

 carroça éra éssa roça roco rato onça orça roca

How can i fix this?

+4

regex grep unicode cat non-ascii-characters

Godfather Jan 19 '11 at 19:02

source share

4 answers

I found a related question here that seems to work.

So, if you try something like:

 cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt

Does this do what you expect?

+1

dule Jan 19 '11 at 19:18

source share

Assuming everything is UTF-8, Id usually uses something like

 perl -CSAD -le 'print if /^carroça{1,3}$/' filenames

because then I know what he is doing.

+1

tchrist Jan 19 '11 at 9:51

source share

Try it like @dule, but with LANG=en_US.iso88591 :

 cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt

0

Unclezeiv Jan 19 '11 at 19:24

source share

ephemient · Accepted Answer · 2011-01-19T19:26:52+0000

If your file is encoded in ISO-8859-1 but your system language is UTF-8, this will not work.

Either convert the file to UTF-8, or change the locale of your system to ISO-8859-1.

  # convert from ISO-8859-1 to the environmental locale before grepping
 # output will be in the current locale
 $ iconv -f 8859_1 input / words.txt |  grep ...

 # run grep with an ISO-8859-1 locale
 # output will be in ISO-8859-1 encoding
 $ cat input / words.txt |  env LC_ALL = en_US grep ...

Grep / regex cannot find accented word

More articles: