A specific way to parse alphanumeric CSVs in Python with scipy / numpy

I am trying to find a good and flexible way to parse CSV files in Python, but none of the standard options seem to fit the bill. I am tempted to write my own, but I think that some combination of what exists in numpy / scipy and the csv module can do what I need, and therefore I do not want to reinvent the wheel.

I would like standard functions to be able to specify separators, indicate whether there is a title, how many lines to skip, comment separators, which columns to ignore, etc. The central function that I am missing is to parse CSV files in such a way as to gracefully process both string data and numeric data. Many of my CSV files have columns containing strings (not having the same length) and numerical data. I would like to be able to use a numpy array for this numeric data, but also be able to access strings. For example, suppose my file looks like this (suppose the columns are separated by tabs):

# my file
name  favorite_integer  favorite_float1  favorite_float2  short_description
johnny  5  60.2  0.52  johnny likes fruitflies
bob 1  17.52  0.001  bob, bobby, robert

data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')

I would like to access data in two ways:

  • : numpy.array, , . - :

    floats_and_ints = data.matrix

    floats_and_ints[:, 0] # access the integers

    floats_and_ints[:, 1:3] # access some of the floats transpose(floats_and_ints) # etc..

  • - , : . , :

    data['favorite_float1'] # get all the values of the column with header "favorite_float1"

    data['name'] # get all the names of the rows

, favorite_float1 , .

, . :

for row in data:
  # print names and favorite integers of all 
  print "Name: ", row["name"], row["favorite_int"]

(1) numpy.array, , , , , .

(2) , , . csv, , . - numpy.array.

csv/numpy/scipy, ? .

, :

  • , , ..
  • numpy.array/matrix , .
  • ( )
+3
4

pandas, numpy. :

In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]: 
        favorite_integer  favorite_float1  favorite_float2        short_description
name                                                                               
johnny                 5            60.20            0.520  johnny likes fruitflies
bob                    1            17.52            0.001       bob, bobby, robert
In [9]: df.describe()
Out[9]: 
       favorite_integer  favorite_float1  favorite_float2
count          2.000000         2.000000         2.000000
mean           3.000000        38.860000         0.260500
std            2.828427        30.179317         0.366988
min            1.000000        17.520000         0.001000
25%            2.000000        28.190000         0.130750
50%            3.000000        38.860000         0.260500
75%            4.000000        49.530000         0.390250
max            5.000000        60.200000         0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]: 
name
johnny    60.20
bob       17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]: 
              short_description  mean_favorite
name                                          
johnny  johnny likes fruitflies      21.906667
bob          bob, bobby, robert       6.173667
+4

matplotlib.mlab.csv2rec numpy recarray, numpy , numpy. , record, , :

rows = matplotlib.mlab.csv2rec('data.csv')
row = rows[0]

print row[0]
print row.name
print row['name']

csv2rec " ", numpy.genfromtext.

, , csv2rec csv.reader numpy.genfromtext.

+2

numpy.genfromtxt()

0

stdlib csv.DictReader?

0

Source: https://habr.com/ru/post/1748677/


All Articles