Using numpy to filter multiple comment characters

I am looking for a way to extract data from a file with multiple comment characters. The input file looks like this:

# filename: sample.txt # Comment 1 # Comment 2 $ Comment 3 1,10 2,20 3,30 4,40 # Comment 4 

It seems to me that I can delete one type of comment using the following code and cannot find documentation on how I can remove both.

 import numpy as np data = np.loadtxt('sample.txt',comments="#") # I need to also filter out '$' 

Are there any alternative methods that I could use for this?

+6
source share
6 answers

for this case you need to resort to the standard python loop through the input, for example. something like that:

 data = [] with open("input.txt") as fd: for line in fd: if line.startswith('#') or line.startswith('$'): continue data.append(map(int, line.strip().split(','))) print data 

output:

 [[1, 10], [2, 20], [3, 30], [4, 40]] 
+2
source

I would create a generator that would ignore comments, and then pass it to np.genfromtxt() :

 gen = (r for r in open('sample.txt') if not r[0] in ('$', '#')) a = np.genfromtxt(gen, delimiter=',') 
+3
source

Just use a list of comments, for example:

 data = np.loadtxt('sample.txt',comments=['#', '$', '@']) 
+2
source

Since your lines only contain a comment or your data, I would just read the file before processing it with numpy. Comment lines will be killed using regular expressions.

 import re from StringIO import StringIO import numpy as np with open('sample.txt', 'r') as f: data = re.sub(r'\s*[#\$].*\n', '', f.read()) data = np.genfromtxt(StringIO(data), dtype=int, delimiter=',') 

This will give you the desired numpy data array. Note that this approach will work if the line (by chance) begins with some space, followed by one of the comment characters.

+1
source

I looked at the numpy.loadtxt code and cannot use more than one character for comment because they use str.split: https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.py# L790

I think you can upload the file line by line, check if the line contains a comment or not, and then pass it to numpy.fromstring .

0
source

If you want to keep the full loadtxt power, you can simply change it to suit your needs. As David Marek noted, the line where comments are removed is this

 line = asbytes(line).split(comments)[0].strip(asbytes('\r\n')) 

becomes:

 for com in comments: line = asbytes(line).split(com)[0] line = line.strip(asbytes('\r\n')) 

You will also need to change L717:

 comments = asbytes(comments) 

turns into:

 comments = [asbytes(com) for com in comments] 

If you want to keep full compatibility,

 if isinstance(comments, basestring): comments = [comments] 
0
source

Source: https://habr.com/ru/post/970987/


All Articles