Trying to determine if the uuencoded file was

Question

Trying to determine if the uuencoded file was

I am trying to process a large collection of txt files, which themselves are containers for the actual files that I want to process. There are sgml tags in txt files that set the boundaries for the individual files that I process. Sometimes the contained files are binary that were uuencoded. I solved the problem of decoding uuencoded files, but when I considered my solution, I decided that it was not general enough. That is, I used

if '\nbegin 644 ' in document['document']

check if uuencoded file was. I did some searches and vaguely what 644 (file permissions) means, and then I found other examples of uuencoded files that might have

if '\nbegin 642 ' in document['document']

or even some other alternatives. So my problem is how can I make sure that I capture / identify all subcontainers with uuencoded files.

One solution is to check each subcontainer:

uudecode=codecs.getdecoder("uu")

for document in documents:
    try:
        decoded_document,m=uudecode(document)
    except ValueError:
         decoded_document=''
    if len(decoded_document)==0
        more stuff

It's not terrible, CPU cycles are cheap, but I'm going to process about 8 million documents.

So, is there a more reliable way to find out if a particular string is the result of uuencoding?

+3

python uuencode uudecode

Pynewbie Jan 11 '11 at 21:32

source share

2 answers

Two ways:

(1) Unix file.

http://unixhelp.ed.ac.uk/CGI/man-cgi?file

$ file foo
foo: uuencoded or xxencoded text

(2) () Python, , , ( http://ubuntuforums.org/archive/index.php/t-1304548.html).

#!/usr/bin/env python
import magic
import sys
filename=sys.argv[1]
ms = magic.open(magic.MAGIC_NONE)
ms.load()
ftype = ms.file(filename)
print ftype
ms.close()

+1

EmeryBerger 11 . '11 21:35

9000 · Accepted Answer · 2011-01-11T21:39:39+0000

Wikipedia says that every uuencoded file starts with this line

begin <perm> <name>

So, probably, the line corresponding to the regular expression ^begin [0-7]{3} (.*)$denotes the beginning quite reliably.

Trying to determine if the uuencoded file was

More articles: