How to aggregate values more than in RAM gzip'ed csv file?

Question

How to aggregate values more than in RAM gzip'ed csv file?

To begin with, I am new to bioinformatics and, in particular, programming, but I created a script that will go through the so-called VCF file (only individuals are included, one clumn = one person) and uses the search string to find out for each option (lines ) whether the person is homozygous or heterozygous.

This script works, at least on small subsets, but I know that it stores everything in memory. I would like to do this on very large zip files (even whole genomes), but I don’t know how to convert this script to a script that does everything line by line (because I want to read whole columns I just don’t understand how to solve this) .

Thus, the output is 5 things per person (general options, the number of homozygotes, the number of heterozygotes and the proportions of homo and heterozygotes). See code below:

#!usr/bin/env python import re import gzip subset_cols = 'subset_cols_chr18.vcf.gz' #nuc_div = 'nuc_div_chr18.txt' gz_infile = gzip.GzipFile(subset_cols, "r") #gz_outfile = gzip.GzipFile(nuc_div, "w") # make a dictionary of the header line for easy retrieval of elements later on headers = gz_infile.readline().rstrip().split('\t') print headers column_dict = {} for header in headers: column_dict[header] = [] for line in gz_infile: columns = line.rstrip().split('\t') for i in range(len(columns)): c_header=headers[i] column_dict[c_header].append(columns[i]) #print column_dict for key in column_dict: number_homozygotes = 0 number_heterozygotes = 0 for values in column_dict[key]: SearchStr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+' #this search string contains the regexp (this regexp was tested) Result = re.search(SearchStr,values) if Result: #here, it will skip the missing genoytypes ./. variant_one = int(Result.group(1)) variant_two = int(Result.group(2)) if variant_one == 0 and variant_two == 0: continue elif variant_one == variant_two: #count +1 in case variant one and two are equal (so 0/0, 1/1, etc.) number_homozygotes += 1 elif variant_one != variant_two: #count +1 in case variant one is not equal to variant two (so 1/0, 0/1, etc.) number_heterozygotes += 1 print "%s homozygotes %s" % (number_homozygotes, key) print "%s heterozygotes %s" % (number_heterozygotes,key) variants = number_homozygotes + number_heterozygotes print "%s variants" % variants prop_homozygotes = (1.0*number_homozygotes/variants)*100 prop_heterozygotes = (1.0*number_heterozygotes/variants)*100 print "%s %% homozygous %s" % (prop_homozygotes, key) print "%s %% heterozygous %s" % (prop_heterozygotes, key)

Any help would be greatly appreciated, so I can continue exploring large datasets, thanks :)

The VCF file, by the way, looks something like this: INDIVIDUAL_1 INDIVIDUAL_2 INDIVIDUAL_3 0/0: 9.0: 9: 24: 0.24.221 1/0: 5.4: 9: 25: 25,0,26 1/1: 0,13: 13: 33: 347.33.0

This is a header line with individual identifier names (I have a total of 33 people with more complex identification tags, here I am simplified), and then I have many of these information lines with the same specific template. I'm only interested in the first part with a slash, therefore, regular epicression.

+6

python gzip csv bioinformatics

visse226 Nov 10 '16 at 13:15

source share

2 answers

Daniel King · Answer 1 · 2016-11-11T19:24:58+0000

Disclosure: I work full time on the Hail project.

Hello! Welcome to programming and bioinformatics!

amirouche correctly identifies that you need some kind of “streaming” or “linear” algorithm to process data that is too large to fit in your machine's RAM. Unfortunately, if you are limited to python without libraries, you need to manually cut the file and handle VCF parsing.

The Hail project is a free open source tool for scientists with genetic data too large to fit in RAM all the way to too large to fit one (i.e. dozens of terabytes of compressed VCF data). Grad can use all the cores on one machine or all the cores on a cloud of machines. Grad runs on Mac OS X and, in most cases, GNU / Linux. Grad provides a statistical genetic language that makes your question much shorter to express.

Easiest answer

The most faithful translation of your Python code in Hail:

 /path/to/hail importvcf -f YOUR_FILE.vcf.gz \ annotatesamples expr -c \ 'sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()' annotatesamples expr -c \ 'sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled' \ exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv

I ran the above command on my dual core laptop in a 2.0 GB file:

 # ls -alh profile225.vcf.bgz -rw-r--r-- 1 dking 1594166068 2.0G Aug 25 15:43 profile225.vcf.bgz # ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \ annotatesamples expr -c \ 'sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()' \ annotatesamples expr -c \ 'sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled' \ exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv hail: info: running: importvcf -f profile225.vcf.bgz [Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()' [Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled' hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv hail: info: while importing: file:/Users/dking/projects/hail-data/profile225.vcf.bgz import clean hail: info: timing: importvcf: 34.211s annotatesamples expr: 6m52.4s annotatesamples expr: 21.399ms exportsamples: 121.786ms total: 7m26.8s # head sampleInfo.tsv sample pHomRef pHet nHom nHet nCalled HG00096 9.49219e-01 5.07815e-02 212325 11359 223684 HG00097 9.28419e-01 7.15807e-02 214035 16502 230537 HG00099 9.27182e-01 7.28184e-02 211619 16620 228239 HG00100 9.19605e-01 8.03948e-02 214554 18757 233311 HG00101 9.28714e-01 7.12865e-02 214283 16448 230731 HG00102 9.24274e-01 7.57260e-02 212095 17377 229472 HG00103 9.36543e-01 6.34566e-02 209944 14225 224169 HG00105 9.29944e-01 7.00564e-02 214153 16133 230286 HG00106 9.25831e-01 7.41687e-02 213805 17128 230933

Wow! Seven minutes at 2 GB, it's slow! Unfortunately, this is because VCFs are not a great format for data analysis!

Storage Format Optimization

Allow to convert Hail, VDS to an optimized storage format and re-run the command:

 # ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds hail: info: running: importvcf -f profile225.vcf.bgz [Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset hail: info: running: write -o profile225.vds [Stage 1:> (0 + 4) / 65] [Stage 1:========================================================>(64 + 1) / 65] # ../hail/build/install/hail/bin/hail read -i profile225.vds \ annotatesamples expr -c \ 'sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()' \ annotatesamples expr -c \ 'sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled' \ exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv hail: info: running: read -i profile225.vds [Stage 1:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [Stage 1:============================================> (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()' [Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled' hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv hail: info: timing: read: 2.969s annotatesamples expr: 1m20.4s annotatesamples expr: 21.868ms exportsamples: 151.829ms total: 1m23.5s

About five times faster! On a larger scale, running the same command in the Google Cloud on VDS, which represents the full VCF, the 1000 genome project (2535 whole genomes, compressed to 315 GB) took 3 m42 using 328 work cores.

Using the built-in hail

Hail also has a sampleqc command that computes most of what you want (and more):

 ../hail/build/install/hail/bin/hail read -i profile225.vds \ sampleqc \ annotatesamples expr -c \ 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled, sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \ exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv hail: info: running: read -i profile225.vds [Stage 0:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [Stage 1:============================================> (3 + 1) / 4]hail: info: running: sampleqc [Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled, sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv hail: info: timing: read: 2.928s sampleqc: 1m27.0s annotatesamples expr: 229.653ms exportsamples: 353.942ms total: 1m30.5s

Hail installation

Installation of hail is quite simple, and we have documents to help you . Need more help? You can get real-time support in the chat chat of Hail users or, if you prefer forums, Hail discourse (both are related to the homepage, unfortunately I do not have enough reputation to create real links).

Near future

In the very near future (less than one month from today), the Hail command will populate the Python API, which will allow you to express the first fragment as:

 result = importvcf("YOUR_FILE.vcf.gz") .annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(), sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(), sa.nHet = gs.filter(g => g.isHet).count()') .annotatesamples('sa.pHom = sa.nHom / sa.nCalled, sa.pHet = sa.nHet / sa.nCalled') for (x in result.sampleannotations): print("Sample " + x + " nCalled " + x.nCalled + " nHom " + x.nHom + " nHet " + x.nHet + " percent Hom " + x.pHom * 100 + " percent Het " + x.pHet * 100) result.sampleannotations.write("sampleInfo.tsv")

EDIT: Added head output to tsv file.

EDIT2: last hail does not require biallelic for sampleqc

EDIT3: Pay attention to cloud scaling with hundreds of cores

amirouche · Answer 2 · 2016-11-11T12:59:38+0000

To be able to process more than a RAM dataset, you need to rework your algorithm to process the data row by row right now, you process each column.

But before that, you need a way to stream strings for the gzip'ed file.

The following Python 3 code does this:

 #!/usr/bin/env python3 import zlib from mmap import PAGESIZE CHUNKSIZE = PAGESIZE # This is a generator that yields *decompressed* chunks from # a gzip file. This is also called a stream or lazy list. # It done like so to avoid to have the whole file into memory # Read more about Python generators to understand how it works. # cf. `yield` keyword. def gzip_to_chunks(filename): decompressor = zlib.decompressobj(zlib.MAX_WBITS + 16) with open(filename,'rb') as f: chunk = f.read(CHUNKSIZE) while chunk: out = decompressor.decompress(chunk) yield out.decode('utf-8') chunk = f.read(CHUNKSIZE) out = decompressor.flush() yield out # Again the following is a generator (see the `yield` keyword). # What id does is iterate over an *iterable* of strings and yields # rows from the file # (hint: `gzip_to_chunks(filename)` returns a generator of strings) # (hint: a generator is also an iterable) # You can verify that by calling `chunks_to_rows` with a list of # strings, where every strings is a chunk of the VCF file. # (hint: a list is also an iterable) # inline doc follows def chunks_to_rows(chunks): row = '' # we will add the chars making a single row to this variable for chunk in chunks: # iterate over the strings/chuncks yielded by gzip_to_chunks for char in chunk: # iterate over all chars from the string if char == '\n': # hey! this is the end of the row! yield row.split('\t') # the row is complete, yield! row = '' # start a new row else: row += char # Otherwise we are in the middle of the row # at this point the program has read all the chunk # at this point the program has read all the file without loading it fully in memory at once # That said, there maybe still something in row if row: yield row.split('\t') # yield the very last row if any [print(e) for e in chunks_to_rows(gzip_to_chunks('file.csv.gz'))]

The first gzip_to_chunks function is basically a template around the python zlib API, and if IMO will be included in Python. Anyway, the second function accepts the first input. If you look carefully, you can recognize the mathy fog operator. In any case, to better understand chunks_to_rows , here is an example of REPL run:

 chunks = [ 'simple\trow\n', # here a full csv row is contained in a chunk 'a row split into tow chunks\t', # here the chunk start a row... 'it still works\n' # ... and ends in the third chunk ] [print(e) for e in chunks_to_rows(chunks)]

The above text will print:

 ['simple', 'row'] ['a row split into two chunks ', 'it still works']

One specific point that I cannot show with examples only is that the chunks_to_rows algorithm is smart enough not to care about how the lines appear in chunks. For example, the following pieces return what you expect:

 >>> chunks = ['A\tSingle\tChunk\nWith\tTwo\tRows'] >>> [print(e) for e in chunks_to_rows(chunks)] ['A', 'Single', 'Chunk'] ['With', 'Two', 'Rows']

Another contrived example

 >>> chunks = list('each chunk will be one char\tand it still works') >>> [print(e) for e in chunks_to_rows(chunks)] ['each chunk will be one char', 'and it still works']

Then I try to convert your algorithm to read data line by line, where each column is individual. In fact, aggregation is performed by subcolumns:

 #!usr/bin/env python3 import re from collections import namedtuple # We only need to compute `homozygotes` and `heterozygotes` for # each column since the other 3 values are computed from that Result = namedtuple('Result', ('header', 'homozygotes', 'heterozygotes')) def filename_to_rows(filename): # XXX: functions defined previously yield from chunks_to_rows(gzip_to_chunks(filename)) filename = 'subset_cols_chr18.vcf.gz' rows = filename_to_rows(filename) headers = next(rows) # fetch the first row # prepare the results where the data will be agg out = dict() for header in headers: out[header] = Result(header, 0, 0) # sum all rows for each column for row in rows: for index, column in enumerate(row): # retrieve the result object where we aggregate the total header = headers[index] result = out[header] # parse entry, FIXME: it faster to use re.compile regex = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+' match = re.search(regex, column) if match: variant_one = int(match.group(1)) variant_two = int(match.group(2)) if variant_one == 0 and variant_two == 0: continue elif variant_one == variant_two: result.homozygotes += 1 else: # variant_one != variant_two: result.heterozygotes += 1 # print all the numbers for header, result in out.items(): print('* ', header) print('** homozygotes ', result.homozygotes) print('** heterozygotes ', result.heterozygotes) variants = result.homozygotes + result.heterozygotes print('** variants ', variants) print('** homozygous %s%%', (1.0*result.homozygotes/variants) * 100) print('** heterozygous %s%%', (1.0*result.heterozygotes/variants) * 100)

NTN!

How to aggregate values ​​more than in RAM gzip'ed csv file?

Easiest answer

Storage Format Optimization

Using the built-in hail

Hail installation

Near future

More articles:

How to aggregate values more than in RAM gzip'ed csv file?