Using special characters in a string argument for the awk match function. Current locale settings

I have a problem using the match function in awk on a string containing special characters. Consider the test.awk file:

 { match($0,"(^.*)kon",a); print a[1]; } 

and the corresponding test file "test.txt" with the contents of "Testing Håkon" (note the Norwegian symbol "å"). The file is encoded in "iso-8859-1" with a length of 14 bytes. The hexadecimal dump of the file is set by xxd -p test.txt as

 54657374696e672048e56b6f6e0a 

From this it can be seen that the Norwegian character “å” was encoded with the hexadecimal number “e5”. That is, the file is encoded using the encoding iso-8859-1 ..

Performance

 awk -f test.awk test.txt 

gives nothing on the terminal .. If the correct output should be "Testing Hå" ..

The output of the locale command:

 LANG=en_DK.UTF-8 LANGUAGE=en_US: LC_CTYPE="en_DK.UTF-8" LC_NUMERIC="en_DK.UTF-8" LC_TIME="en_DK.UTF-8" LC_COLLATE="en_DK.UTF-8" LC_MONETARY="en_DK.UTF-8" LC_MESSAGES="en_DK.UTF-8" LC_PAPER="en_DK.UTF-8" LC_NAME="en_DK.UTF-8" LC_ADDRESS="en_DK.UTF-8" LC_TELEPHONE="en_DK.UTF-8" LC_MEASUREMENT="en_DK.UTF-8" LC_IDENTIFICATION="en_DK.UTF-8" LC_ALL= 

which shows that the variable "LANG" is set to utf-8 encoding ..

+4
source share
2 answers

This is not an awk issue see here . Your locale expects UTF-8 encoding, but your file uses iso-8859-1 to either set your locale to match your file, or vice versa.

Note: the second argument to match() must be regular and the final ; not required

 { match($0,/(^.*)kon/,a) print a[1] } 
+1
source

I changed your code as:

 { match($0,"(^.*)kon",a); print ">>>" a[1] "<<<"; } 

The result of GNU Awk 3.1.6 on Windows 7:

 >>>Hå<<< 

On Ubuntu running GNU Awk 3.1.8 I get:

 >>><<< 

To get the desired result, I had to temporarily change the locale settings and translate:

 LC_ALL=ISO_8859-1 awk -f test.awk test.txt | iconv -f ISO_8859-1 -t UTF-8 
+2
source

Source: https://habr.com/ru/post/1482828/


All Articles