I needed to do this recursively, and here is what I came up with:
find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done
It is absolutely terrible and very slow; I am sure that there is a better way, and I hope someone can improve it, but I was in a hurry: P
What the pieces do:
find -type f
gives a recursive list of file names with paths relative to the current
while read l; do ... done
Bash loop; for each line in the list of file paths, put the path in $l and execute the item in a loop. (Why I used a shell loop instead of xargs, which would be much faster: I need a prefix for each line of output with the name of the current file. Could not think about how to do this if I fed several files at once in iconv, and since I'm going to make one file at a time anyway, the shell loop is easier syntax / escaping.)
iconv -s -f utf-16le -t utf-8 "$l"
Convert the file with the name to $l : suppose the input file is utf-16 little-endian and convert it to utf-8. -s causes iconv to dwell on any conversion errors (there will be a lot of them, because some files in this directory structure are not utf-16). The result of this conversion goes to standard output.
nl -s "$l: " | cut -c7-
This is a hack: nl inserts line numbers, but there is "use this arbitrary line to separate the number from the line", so I put the file name (followed by a colon and space) in this, then I use cut to remove the line number, leaving file name prefix only. (Why I didn’t use sed : escaping is much simpler. If I used sed, I have to worry about the presence of regular expressions in the file names, which was a lot in my case. nl much deeper than sed , and just takes the -s literally, and the shell will handle the escaping for me.)
So, by the end of this pipeline, I converted a bunch of files into utf-8 strings with the file name prefix, which I then grep. If there are matches, I can indicate in which file they are from the prefix.
Warnings
- This is much slower than
grep -R because I create a new copy of iconv , nl , cut and grep for each individual file. This is terrible. - Everything that is not utf-16le input will come out as complete garbage, so if there is a regular ASCII file containing "somestring", this command will not report this - you need to do regular
grep -R as well as this command (and if you have several types of Unicode encoding, for example, some files of large and junior order, you need to configure this command and run it again for each different encoding). - Files whose name contains "somestring" will be displayed in the output, even if their contents do not match.