Background:
Python 2.6.6 on Linux. The first part of the analysis of DNA sequence analysis.
I want to read a gzipped file with remote storage (LAN) installed, and if it was gzipped; gunzip it into the stream (ie using gunzip FILENAME -c ), and if the first character of the stream (file) is "@", lay the entire stream in a filter program that accepts input on standard input, otherwise just connect it directly to file on the local drive. I would like to minimize the number of read / search files from the remote repository (just one pass through the file cannot be impossible?).
The contents of the sample input file, the first four lines corresponding to one record in FASTQ format:
@I328_1_FC30MD2AAXX:8:1:1719:1113/1 GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG +I328_1_FC30MD2AAXX:8:1:1719:1113/1 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc
Files that should not be sent to the filtering program contain entries that look like this (first two lines corresponding to one entry in FASTA format):
>I328_1_FC30MD2AAXX:8:1:1719:1113/1 GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG
Some of them attempt a semi-pseudo-code to visualize what I want to do (I know this is not possible, as I wrote it). Hope this makes sense:
if gzipped: gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE) if gunzip.stdout.peek(1) == "@": # This isn't possible fastq = True else: fastq = False if fastq: filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate() else: # Send the gunzipped stream to another file
Do not pay attention to the fact that the code will not work, as I wrote it here, and that I do not have error handling, etc., all this is already in my other code. I just need help to peek into the stream or find a way around this. I would be nice if you could gunzip.stdout.peek(1) , but I understand that this is not possible.
What I have tried so far:
I realized subprocess.Popen can help me achieve this, and I have tried many different ideas, among others I am trying to use some kind of io.BufferedRandom () object to write a stream, but I cannot figure out how this should work. I know that streams are not searchable, but perhaps a workaround could be to read the first character of the gunzip stream, and then create a new stream in which you first enter "@" or ">" depending on the contents of the file, and then fill the rest of the gunzip.stdout stream into a new stream. This new stream will then be fed to the Popen stdin filter.
Please note that the file size can be several times larger than the available memory. I do not want to perform more than one reading of the source file from remote storage and without unnecessary access to the files.
Any ideas are welcome! Please ask me questions so that I can clarify if I have clarified this.