Grepping and utf16 binaries

grep / pcregrep etc. convenient to use with binary files for ASCII or UTF8 data - is there an easy way to get them to try UTF16 too (preferably at the same time, but will do it instead)?

The data I'm trying to get is all ASCII anyway (links in libraries, etc.), it just was not found, as sometimes there is 00 between any two characters, and sometimes not.

I see no way to do this semantically, but these 00 should do the trick, but I cannot easily use them on the command line.

+61
grep unicode utf-16
Sep 20 '10 at 15:25
source share
10 answers

The easiest way is to simply convert the text file to utf-8 and pass it to grep:

 iconv -f utf-16 -t utf-8 file.txt | grep query 

I tried to do the opposite (convert my request to utf-16), but grep doesn't seem to like this. I think it might be related to the content, but I'm not sure.

It seems grep is converting a utf-16 request to utf-8 / ascii. Here is what I tried:

 grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt 

If test.txt is a utf-16 file, this will not work, but it works if test.txt is ascii. I can only conclude that grep will convert my request to ascii.

EDIT: Here's a really really crazy one that works, but doesn't give you very useful information:

 hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'` 

How it works? It converts your file to hexadecimal (without the additional formatting that is commonly used in hexdump). He connects this to grep. Grep uses a request, which is created by repeating your request (without a new line) in iconv, which converts it to utf-16. It is then passed to sed to remove the specification (the first two bytes of the utf-16 file used to define the entity). Then it is passed to hexdump so that the request and input match.

Unfortunately, I think this will lead to a printout of the ENTIRE file if there is one match. Also, this will not work if utf-16 in your binary is stored in a different form than your machine.

EDIT2: Got it !!!!

 grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt 

This searches for the hexadecimal version of the Test string (in utf-16) in the test.txt file

+65
Sep 23 '10 at 18:01
source share

You can explicitly specify zeros (00s) in the search bar, although you will get results with zeros, so you can redirect the output to a file so you can view it with a reasonable editor or pass it through sed to replace zeros. Search "bar" in * .utf16.txt:

 grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g' 

"-P" tells grep to accept the regexp Perl syntax, which allows \ x00 to expand to zero, and -a tells it to ignore the fact that Unicode looks like binary.

+14
Nov 10 '15 at 2:28
source share

I found that the solution below works best for me, with https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/

Grep does not work very well with Unicode, but you can work around it. For example, to find

 Some Search Term 

in a UTF-16 file, use a regular expression to ignore the first byte in each character,

 Some .Search .Term 

Also, tell grep to treat the file as text using '-a', the last command looks like this:

 grep -a 'Some .Search .Term' utf-16-file.txt 
+9
Mar 01 '18 at 22:09
source share

I use this all the time after resetting the Windows registry since its output is unicode. This works under Cygwin.

 $ regedit /e registry.data.out $ file registry.data.out registry.data.out: Little-endian **UTF-16 Unicode text**, with CRLF line terminators $ sed 's/\x00//g' registry.data.out | egrep "192\.168" "Port"="192.168.1.5" "IPSubnetAddress"="192.168.189.0" "IPSubnetAddress"="192.168.102.0" [HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5] "HostName"="192.168.1.5" "Port"="192.168.1.5" "LocationInformation"="http://192.168.1.28:1215/" "LocationInformation"="http://192.168.1.5:80/WebServices/Device" "LocationInformation"="http://192.168.1.5:80/WebServices/Device" "StandaloneDhcpAddress"="192.168.173.1" "ScopeAddressBackup"="192.168.137.1" "ScopeAddress"="192.168.137.1" "DhcpIPAddress"="192.168.1.24" "DhcpServer"="192.168.1.1" "0.0.0.0,0.0.0.0,192.168.1.1,-1"="" [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5] "HostName"="192.168.1.5" "Port"="192.168.1.5" "LocationInformation"="http://192.168.1.28:1215/" "LocationInformation"="http://192.168.1.5:80/WebServices/Device" "LocationInformation"="http://192.168.1.5:80/WebServices/Device" "StandaloneDhcpAddress"="192.168.173.1" "ScopeAddressBackup"="192.168.137.1" "ScopeAddress"="192.168.137.1" "DhcpIPAddress"="192.168.1.24" "DhcpServer"="192.168.1.1" "0.0.0.0,0.0.0.0,192.168.1.1,-1"="" "MRU0"="192.168.16.93" [HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\192.168.16.93] "A"="192.168.1.23" "B"="192.168.1.28" "C"="192.168.1.200:5800" "192.168.254.190::5901/extra"=hex:02,00 "00"="192.168.254.190:5901" "ImagePrinterPort"="192.168.1.5" 
+5
Aug 29 '14 at 23:11
source share

I needed to do this recursively, and here is what I came up with:

 find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done 

It is absolutely terrible and very slow; I am sure that there is a better way, and I hope someone can improve it, but I was in a hurry: P

What the pieces do:

 find -type f 

gives a recursive list of file names with paths relative to the current

 while read l; do ... done 

Bash loop; for each line in the list of file paths, put the path in $l and execute the item in a loop. (Why I used a shell loop instead of xargs, which would be much faster: I need a prefix for each line of output with the name of the current file. Could not think about how to do this if I fed several files at once in iconv, and since I'm going to make one file at a time anyway, the shell loop is easier syntax / escaping.)

 iconv -s -f utf-16le -t utf-8 "$l" 

Convert the file with the name to $l : suppose the input file is utf-16 little-endian and convert it to utf-8. -s causes iconv to dwell on any conversion errors (there will be a lot of them, because some files in this directory structure are not utf-16). The result of this conversion goes to standard output.

 nl -s "$l: " | cut -c7- 

This is a hack: nl inserts line numbers, but there is "use this arbitrary line to separate the number from the line", so I put the file name (followed by a colon and space) in this, then I use cut to remove the line number, leaving file name prefix only. (Why I didn’t use sed : escaping is much simpler. If I used sed, I have to worry about the presence of regular expressions in the file names, which was a lot in my case. nl much deeper than sed , and just takes the -s literally, and the shell will handle the escaping for me.)

So, by the end of this pipeline, I converted a bunch of files into utf-8 strings with the file name prefix, which I then grep. If there are matches, I can indicate in which file they are from the prefix.

Warnings

  • This is much slower than grep -R because I create a new copy of iconv , nl , cut and grep for each individual file. This is terrible.
  • Everything that is not utf-16le input will come out as complete garbage, so if there is a regular ASCII file containing "somestring", this command will not report this - you need to do regular grep -R as well as this command (and if you have several types of Unicode encoding, for example, some files of large and junior order, you need to configure this command and run it again for each different encoding).
  • Files whose name contains "somestring" will be displayed in the output, even if their contents do not match.
+4
Dec 11 '15 at 21:38
source share

ripgrep

Use the ripgrep utility to grep UTF-16 files.

ripgrep supports searching for files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and others. ( -E some support for automatically detecting UTF-16. Other text encodings must be specifically specified with the -E / --Encoding flag. )

Syntax Example:

 rg sometext file 

To clear all lines, run: rg -N. file rg -N. file rg -N. file rg -N. file

+2
Jan 17 '19 at 12:55
source share

The sed statement is more than I can wrap around myself. I have a simplified, far from ideal TCL script, which I think does the OK job with my test point:

 #!/usr/bin/tclsh set insearch [lindex $argv 0] set search "" for {set i 0} {$i<[string length $insearch]-1} {incr i} { set search "${search}[string range $insearch $i $i]." } set search "${search}[string range $insearch $i $i]" for {set i 1} {$i<$argc} {incr i} { set file [lindex $argv $i] set status 0 if {! [catch {exec grep -a $search $file} results options]} { puts "$file: $results" } } 
0
Jul 15 '13 at 19:53 on
source share

I added this as a comment on the accepted answer above, but to make it easier to read. This allows you to search for text in a bunch of files, as well as display the names of the files that it finds in the text. All of these files have a .reg extension, as I am viewing the exported Windows registry files. Just replace .reg with any file extension.

 // Define grepreg in bash by pasting at bash command prompt grepreg () { find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg" } // Sample usage grepreg SampleTextToSearch 
0
Oct 16 '15 at 13:52
source share

You can use the following Ruby single-line:

 ruby -e "puts File.open('file.txt', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new 'PATTERN'.encode(Encoding::UTF_16LE))" 



For simplicity, this can be defined as a shell function, for example:

 grep-utf16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; } 

Then it will be used similarly to grep:

 grep-utf16 PATTERN file.txt 

Source: How to use Ruby's readlines.grep for UTF-16 files?

0
May 20 '19 at 23:17
source share

ugrep (Universal grep) supports Unicode, UTF-8/16/32 files, detects invalid Unicode to ensure correct results, displays text and binary files, and also works quickly and free of charge:

ugrep looks for input UTF-8/16/32 and other formats. The --encoding option allows you to search in many other file formats, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250–1258.

Download eel from GitHub

0
Sep 10 '19 at 21:13
source share



All Articles