Comparing Linux strings and Perl String

Since I was dealing with very large files, I sorted the base and candidate files before comparing them to see which lines were missing from the other. I did this so as not to store the records in memory. Sorting was performed using the Linux command line tool, sorting.

In my Perl script, I would look if there was a line in the line lt, gt or eq in a line in another file, pushing the pointers in the file where necessary. However, I ran into a problem when I noticed that my string comparison showed that the lines in the base file were a line in a candidate file that contained special characters.

Is there any surefire way to make sure my Linux and Perl string comparisons are built using the same type of string comparator?

+4
source share
1 answer

The sort command uses the current locale, as indicated by the LC_ALL environment LC_ALL , to determine the sort order for characters. Usually the easiest way to fix sorting problems is to manually set it to the C locale, which processes each 8-bit byte as a single character and compares it with a simple numeric value. In most shells, this can be done as a one-time use for only one command, prefixed like this:

 LC_ALL=C sort < infile > outfile 

It will also solve similar problems for some other word processing programs. (For example, I recall the problems associated with CSV files on a German person’s computer - this was due to the fact that the Germans used a comma instead of a decimal point. Putting LC_ALL=C in front of the corresponding commands fixed this problem too.)

[EDIT] Although Perl may be directed to process some strings as Unicode, it still treats input and output as 8-bit byte streams, so the above approach should result in something similar to the Perl sort() function. (Thanks to Ven'Tatsu for this nugget.)

+8
source

Source: https://habr.com/ru/post/1336413/


All Articles