Grep / regex cannot find accented word

I am trying to set a regular expression that contains several words in a file, where all the letters of this word match the pattern of words.

My problem is that the regular expression cannot find words with an accent, but there are a lot of accented words in my text file.

My command line is:

cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt 

And the contents of the file:

 carroça éra éssa roça roco rato onça orça roca 

How can i fix this?

+4
source share
4 answers

If your file is encoded in ISO-8859-1 but your system language is UTF-8, this will not work.

Either convert the file to UTF-8, or change the locale of your system to ISO-8859-1.

  # convert from ISO-8859-1 to the environmental locale before grepping
 # output will be in the current locale
 $ iconv -f 8859_1 input / words.txt |  grep ...

 # run grep with an ISO-8859-1 locale
 # output will be in ISO-8859-1 encoding
 $ cat input / words.txt |  env LC_ALL = en_US grep ...
+7
source

I found a related question here that seems to work.

So, if you try something like:

 cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt 

Does this do what you expect?

+1
source

Assuming everything is UTF-8, Id usually uses something like

 perl -CSAD -le 'print if /^carroça{1,3}$/' filenames 

because then I know what he is doing.

+1
source

Try it like @dule, but with LANG=en_US.iso88591 :

 cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt 
0
source

Source: https://habr.com/ru/post/1336196/


All Articles