Upload csv file to numpy and access columns by name

I have a csv file with headers like:

This test.csv file:

 "A","B","C","D","E","F","timestamp" 611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12 611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12 611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12 

I just want to load it as a / ndarray matrix with 3 rows and 7 columns, and I also want to access column vectors from the given column name . If I use genfromtxt (as shown below), I get ndarray with 3 rows (one per line) and no columns.

 r = np.genfromtxt('test.csv',delimiter=',',dtype=None, names=True) print r print r.shape [ (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291111964948.0) (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291113113366.0) (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291120650486.0)] (3,) 

I can get column vectors from column names as follows:

 print r['A'] [ 611.88243 611.88243 611.88243] 

If I use load.txt , then I get an array with 3 rows and 7 columns, but cannot access columns using column names (as shown below).

 numpy.loadtxt(open("test.csv","rb"),delimiter=",",skiprows=1) 

I get

  [ [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12] [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12] [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12] ] 

Is there any approach in Python that I can achieve both requirements together ( access columns by coluumn name like np.genfromtext and have a matrix like np.loadtxt )?

+6
source share
2 answers

Using only numpy, the parameters you show are your only parameters. Either use a ndarray of a uniform dtype with form (3,7), or a structured array of a (potentially) heterogeneous type and form (3).

If you really need a data structure with marked columns and a form (3,7), (and many other useful properties), you can use pandas DataFrame:

 In [67]: import pandas as pd In [68]: df = pd.read_csv('data'); df Out[68]: ABCDEF timestamp 0 611.88243 9089.5601 5133 864.07514 1715.37476 765.22777 1.291112e+12 1 611.88243 9089.5601 5133 864.07514 1715.37476 765.22777 1.291113e+12 2 611.88243 9089.5601 5133 864.07514 1715.37476 765.22777 1.291121e+12 In [70]: df['A'] Out[70]: 0 611.88243 1 611.88243 2 611.88243 Name: A, dtype: float64 In [71]: df.shape Out[71]: (3, 7) 

A clean alternative to NumPy / Python is to use dict to map column names to indexes:

 import numpy as np import csv with open(filename) as f: reader = csv.reader(f) columns = next(reader) colmap = dict(zip(columns, range(len(columns)))) arr = np.matrix(np.loadtxt(filename, delimiter=",", skiprows=1)) print(arr[:, colmap['A']]) 

gives

 [[ 611.88243] [ 611.88243] [ 611.88243]] 

So arr is a NumPy matrix, with columns that can be accessed using a label using syntax

 arr[:, colmap[column_name]] 
+6
source

Since your data is homogeneous — all elements are floating point values ​​— you can create a representation of the data returned by genfromtxt , which is a 2D array. For instance,

 In [42]: r = np.genfromtxt("test.csv", delimiter=',', names=True) 

Create a numpy array, which is the "representation" of r . This is a normal numpy array, but it is created using data in r :

 In [43]: a = r.view(np.float64).reshape(len(r), -1) In [44]: a.shape Out[44]: (3, 7) In [45]: a[:, 0] Out[45]: array([ 611.88243, 611.88243, 611.88243]) In [46]: r['A'] Out[46]: array([ 611.88243, 611.88243, 611.88243]) 

r and a refer to the same memory block:

 In [47]: a[0, 0] = -1 In [48]: r['A'] Out[48]: array([ -1. , 611.88243, 611.88243]) 
+2
source

Source: https://habr.com/ru/post/970629/


All Articles