I am trying to process a large collection of txt files, which themselves are containers for the actual files that I want to process. There are sgml tags in txt files that set the boundaries for the individual files that I process. Sometimes the contained files are binary that were uuencoded. I solved the problem of decoding uuencoded files, but when I considered my solution, I decided that it was not general enough. That is, I used
if '\nbegin 644 ' in document['document']
check if uuencoded file was. I did some searches and vaguely what 644 (file permissions) means, and then I found other examples of uuencoded files that might have
if '\nbegin 642 ' in document['document']
or even some other alternatives. So my problem is how can I make sure that I capture / identify all subcontainers with uuencoded files.
One solution is to check each subcontainer:
uudecode=codecs.getdecoder("uu")
for document in documents:
try:
decoded_document,m=uudecode(document)
except ValueError:
decoded_document=''
if len(decoded_document)==0
more stuff
It's not terrible, CPU cycles are cheap, but I'm going to process about 8 million documents.
So, is there a more reliable way to find out if a particular string is the result of uuencoding?
source
share