Comparing Linux strings and Perl String

Question

Comparing Linux strings and Perl String

Since I was dealing with very large files, I sorted the base and candidate files before comparing them to see which lines were missing from the other. I did this so as not to store the records in memory. Sorting was performed using the Linux command line tool, sorting.

In my Perl script, I would look if there was a line in the line lt, gt or eq in a line in another file, pushing the pointers in the file where necessary. However, I ran into a problem when I noticed that my string comparison showed that the lines in the base file were a line in a candidate file that contained special characters.

Is there any surefire way to make sure my Linux and Perl string comparisons are built using the same type of string comparator?

+4

string sorting perl

syker Jan 21 '11 at 3:34

source share

1 answer

j_random_hacker · Answer 1 · 2011-01-21T04:02:23+0000

The sort command uses the current locale, as indicated by the LC_ALL environment LC_ALL , to determine the sort order for characters. Usually the easiest way to fix sorting problems is to manually set it to the C locale, which processes each 8-bit byte as a single character and compares it with a simple numeric value. In most shells, this can be done as a one-time use for only one command, prefixed like this:

 LC_ALL=C sort < infile > outfile

It will also solve similar problems for some other word processing programs. (For example, I recall the problems associated with CSV files on a German person’s computer - this was due to the fact that the Germans used a comma instead of a decimal point. Putting LC_ALL=C in front of the corresponding commands fixed this problem too.)

[EDIT] Although Perl may be directed to process some strings as Unicode, it still treats input and output as 8-bit byte streams, so the above approach should result in something similar to the Perl sort() function. (Thanks to Ven'Tatsu for this nugget.)

Comparing Linux strings and Perl String

More articles: