Trying to determine if the uuencoded file was

I am trying to process a large collection of txt files, which themselves are containers for the actual files that I want to process. There are sgml tags in txt files that set the boundaries for the individual files that I process. Sometimes the contained files are binary that were uuencoded. I solved the problem of decoding uuencoded files, but when I considered my solution, I decided that it was not general enough. That is, I used

if '\nbegin 644 ' in document['document']

check if uuencoded file was. I did some searches and vaguely what 644 (file permissions) means, and then I found other examples of uuencoded files that might have

if '\nbegin 642 ' in document['document']

or even some other alternatives. So my problem is how can I make sure that I capture / identify all subcontainers with uuencoded files.

One solution is to check each subcontainer:

uudecode=codecs.getdecoder("uu")

for document in documents:
    try:
        decoded_document,m=uudecode(document)
    except ValueError:
         decoded_document=''
    if len(decoded_document)==0
        more stuff

It's not terrible, CPU cycles are cheap, but I'm going to process about 8 million documents.

So, is there a more reliable way to find out if a particular string is the result of uuencoding?

+3
source share
2 answers

Wikipedia says that every uuencoded file starts with this line

begin <perm> <name>

So, probably, the line corresponding to the regular expression ^begin [0-7]{3} (.*)$denotes the beginning quite reliably.

+2
source

Two ways:

(1) Unix file.

http://unixhelp.ed.ac.uk/CGI/man-cgi?file

$ file foo
foo: uuencoded or xxencoded text

(2) () Python, , , ( http://ubuntuforums.org/archive/index.php/t-1304548.html).

#!/usr/bin/env python
import magic
import sys
filename=sys.argv[1]
ms = magic.open(magic.MAGIC_NONE)
ms.load()
ftype = ms.file(filename)
print ftype
ms.close()
+1

Source: https://habr.com/ru/post/1784837/


All Articles