Disclosure: I work full time on the Hail project.
Hello! Welcome to programming and bioinformatics!
amirouche correctly identifies that you need some kind of “streaming” or “linear” algorithm to process data that is too large to fit in your machine's RAM. Unfortunately, if you are limited to python without libraries, you need to manually cut the file and handle VCF parsing.
The Hail project is a free open source tool for scientists with genetic data too large to fit in RAM all the way to too large to fit one (i.e. dozens of terabytes of compressed VCF data). Grad can use all the cores on one machine or all the cores on a cloud of machines. Grad runs on Mac OS X and, in most cases, GNU / Linux. Grad provides a statistical genetic language that makes your question much shorter to express.
Easiest answer
The most faithful translation of your Python code in Hail:
/path/to/hail importvcf -f YOUR_FILE.vcf.gz \ annotatesamples expr -c \ 'sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()' annotatesamples expr -c \ 'sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled' \ exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
I ran the above command on my dual core laptop in a 2.0 GB file:
Wow! Seven minutes at 2 GB, it's slow! Unfortunately, this is because VCFs are not a great format for data analysis!
Storage Format Optimization
Allow to convert Hail, VDS to an optimized storage format and re-run the command:
About five times faster! On a larger scale, running the same command in the Google Cloud on VDS, which represents the full VCF, the 1000 genome project (2535 whole genomes, compressed to 315 GB) took 3 m42 using 328 work cores.
Using the built-in hail
Hail also has a sampleqc
command that computes most of what you want (and more):
../hail/build/install/hail/bin/hail read -i profile225.vds \ sampleqc \ annotatesamples expr -c \ 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled, sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \ exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv hail: info: running: read -i profile225.vds [Stage 0:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html
Hail installation
Installation of hail is quite simple, and we have documents to help you . Need more help? You can get real-time support in the chat chat of Hail users or, if you prefer forums, Hail discourse (both are related to the homepage, unfortunately I do not have enough reputation to create real links).
Near future
In the very near future (less than one month from today), the Hail command will populate the Python API, which will allow you to express the first fragment as:
result = importvcf("YOUR_FILE.vcf.gz") .annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()') .annotatesamples('sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled') for (x in result.sampleannotations): print("Sample " + x + " nCalled " + x.nCalled + " nHom " + x.nHom + " nHet " + x.nHet + " percent Hom " + x.pHom * 100 + " percent Het " + x.pHet * 100) result.sampleannotations.write("sampleInfo.tsv")
EDIT: Added head
output to tsv file.
EDIT2: last hail does not require biallelic for sampleqc
EDIT3: Pay attention to cloud scaling with hundreds of cores