Using numpy to filter multiple comment characters

Question

Using numpy to filter multiple comment characters

I am looking for a way to extract data from a file with multiple comment characters. The input file looks like this:

# filename: sample.txt # Comment 1 # Comment 2 $ Comment 3 1,10 2,20 3,30 4,40 # Comment 4

It seems to me that I can delete one type of comment using the following code and cannot find documentation on how I can remove both.

 import numpy as np data = np.loadtxt('sample.txt',comments="#") # I need to also filter out '$'

Are there any alternative methods that I could use for this?

+6

python numpy

tirefire Jun 18 '14 at 7:24

source share

6 answers

I would create a generator that would ignore comments, and then pass it to np.genfromtxt() :

 gen = (r for r in open('sample.txt') if not r[0] in ('$', '#')) a = np.genfromtxt(gen, delimiter=',')

+3

Saullo castro Jun 18 '14 at 8:07

source share

Just use a list of comments, for example:

 data = np.loadtxt('sample.txt',comments=['#', '$', '@'])

+2

Vladas O. Feb 02 '16 at 20:54

source share

Since your lines only contain a comment or your data, I would just read the file before processing it with numpy. Comment lines will be killed using regular expressions.

 import re from StringIO import StringIO import numpy as np with open('sample.txt', 'r') as f: data = re.sub(r'\s*[#\$].*\n', '', f.read()) data = np.genfromtxt(StringIO(data), dtype=int, delimiter=',')

This will give you the desired numpy data array. Note that this approach will work if the line (by chance) begins with some space, followed by one of the comment characters.

+1

timgeb Jun 18 '14 at 7:39

source share

I looked at the numpy.loadtxt code and cannot use more than one character for comment because they use str.split: https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.py# L790

I think you can upload the file line by line, check if the line contains a comment or not, and then pass it to numpy.fromstring .

0

David marek Jun 18 '14 at 7:35

source share

If you want to keep the full loadtxt power, you can simply change it to suit your needs. As David Marek noted, the line where comments are removed is this

 line = asbytes(line).split(comments)[0].strip(asbytes('\r\n'))

becomes:

 for com in comments: line = asbytes(line).split(com)[0] line = line.strip(asbytes('\r\n'))

You will also need to change L717:

 comments = asbytes(comments)

turns into:

 comments = [asbytes(com) for com in comments]

If you want to keep full compatibility,

 if isinstance(comments, basestring): comments = [comments]

0

Davidmh Jun 18 '14 at 13:51

source share

Fredrik pihl · Accepted Answer · 2014-06-18T07:45:41+0000

for this case you need to resort to the standard python loop through the input, for example. something like that:

 data = [] with open("input.txt") as fd: for line in fd: if line.startswith('#') or line.startswith('$'): continue data.append(map(int, line.strip().split(','))) print data

output:

 [[1, 10], [2, 20], [3, 30], [4, 40]]

Using numpy to filter multiple comment characters

More articles: