Memory card

I will explain what my problem is, because it is important to understand what I want :-).

I am working on a pipeline written in pion that uses several external tools to perform several analyzes of genomics data. One of these tools works with very large fastq files, which in the end are nothing more than regular text files.

Usually these fastq files are gzipped, and since they are plain text, the compression ratio is very high. Most data analysis tools can work with gzipped files, but we have a few that cannot. So, we make unzipp files, work with them and finally compress.

As you can imagine, this process:

  • Slower
  • Large drive consumption
  • Bandwidth consumption (when working on the NFS file system)

So, I'm trying to find a way to β€œtrick” these tools into working directly with gzipped files without having to touch the source code of the tools.

I was thinking about using FIFO files, and I tried this, but it doesn’t work if the tool reads the file more than once or if the tool searches for the file.

So basically I have questions:

  • Is there a way to map the file in memory so you can do something like:

    ./tool mapped_file (where mapped_file is not a file, but a link to a file with memory mapping.

  • Do you have any other suggestions on how I can achieve my goal?

Thanks everyone!

+4
source share
4 answers

If your script can read from standard input, then one possibility is to unzip and transfer the stream using zcat, and then transfer it to your script.

Something like that:

 zcat large_file.gz | ./tool 

If you want to compress your results, then you can simply translate the output to gzip again:

 zcat large_file.gz | ./tool | gzip - > output.gz 

Otherwise, you can look at python support for memory mapping:

http://docs.python.org/library/mmap.html

Finally, you can convert ASCII fastq files to BAM format, which is not compressed (as such), but uses a more compact format, which will save you space. See the following:

http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam

+2
source

Take a look at the winning entries in the Pistoia Alliance Afterquence Squeeze , which evaluates FASTQ compression tools. You can find a tool that reduces I / O overhead due to random access and faster decompression.

+2
source

From this answer, you can load the entire uncompressed file into RAM:

 mkdir /mnt/ram mount -t ramfs ram /mnt/ram # uncompress your file to that directory ./tool /mnt/ram/yourdata 

This, however, has the disadvantage of loading everything into a ram: you need to have enough space to store uncompressed data!

Use umount /mnt/ram when you are done.

+2
source

You can write a fuse file system driver if you are using linux: http://pypi.python.org/pypi/fuse-python

The fuse driver needs to compress and decompress the files. Perhaps something similar already exists.

0
source

Source: https://habr.com/ru/post/1439370/


All Articles