Data Sharing for ML

Question

Data Sharing for ML

I imported a dataset for a machine learning project. I need each "Neuron" in my first input layer to contain one digital piece of data. However, I could not do this. Here is my code:

import math import numpy as np import pandas as pd; v = pd.read_csv('atestred.csv', error_bad_lines=False).values rw = 1 print(v) for x in range(0,10): rw += 1 s = (v[rw]) list(s) #s is one row of the dataset print(s)#Just a debug. myvar = s class l1neuron(object): def gi(): for n in range(0, len(s)): x = (s[n]) print(x)#Just another debug n11 = l1neuron n11.gi()

Ideally, I would like this option when the code creates a new variable for each new row that it extracts from the data (which I try to do in the first loop), and a new variable for each extracted piece of data from each row (what I try to do in class and in the second cycle).

If I completely lost the point with my code, then feel free to point me in the right direction for a complete rewrite.

Here are the first few lines of my dataset:

 fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality" 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5 7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5

Thanks in advance.

+5

python python-3.x numpy pandas

3141 Dec 30 '17 at 20:14

source share

2 answers

user508402 · Answer 1 · 2018-02-14T19:08:30+0000

If I understand your problem well, you would like to convert each row in your csv table into a separate variable, which, in turn, contains all the values of this row. Here is an example of how you can do this. There are many ways to do this, while others may be more effective, faster, more pythonic, hippy, or others. But the code below was written to help you understand how to store tabular data in named variables.

Two points:

If reading data is the only thing you need pandas you can look for a less complicated solution
the L1Neuron class is not very transparent, while its members cannot be read from the code, but instead a runtime is created from the list of variables in attrs. You might want to take a look at namedTuples for better readability.

`

 import pandas as pd from io import StringIO import numbers # example data: atestred = StringIO("""fixed acidity;volatile acidity;citric acid;\ residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;\ density;pH;sulphates;alcohol;quality 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5 7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5 """) # read example data into dataframe 'data'; extract values and column names: data = pd.read_csv(atestred, error_bad_lines=False, sep=';') colNames = list(data) class L1Neuron(object): "neuron class that holds the variables of one data line" def __init__(self, **attr): """ attr is a dict (like {'alcohol': 12, 'pH':7.4}); every pair in attr will result in a member variable of this object with that name and value""" for name, value in attr.items(): setattr(self, name.replace(" ", "_"), value) def gi(self): "print all numeric member variables whose names don't start with an underscore:" for v in sorted(dir(self)): if not v.startswith('_'): value = getattr(self, v) if isinstance(value, numbers.Number): print("%-20s = %5.2f" % (v, value)) print('-'*50) # read csv into variables (one for each line): neuronVariables = [] for s in data.values: variables = dict(zip(colNames, s)) neuron = L1Neuron(**variables) neuronVariables.append(neuron) # now the variables in neuronVariables are ready to be used: for n11 in neuronVariables: print("free sulphur dioxide in this variable:", n11.free_sulfur_dioxide, end = " of ") print(n11.total_sulfur_dioxide, "total sulphur dioxide" ) n11.gi()

mr_snuffles · Answer 2 · 2018-02-17T08:42:07+0000

If this is for a machine learning project, I would recommend loading the CSV into a numpy array for easy manipulation. You save each value in the table as your own variable, but it will give you success by not allowing you to use vectorized operations, as well as making it difficult to work with your data. I would suggest the following:

from numpy import genfromtxt my_data = genfromtxt('my_file.csv', delimiter=',')

If your machine learning problem is being controlled, you will also want to separate your tags into a separate data structure. However, if you are doing unsupervised learning, one data structure is enough. If you provide an additional context for the problem you are trying to solve, we can provide you with additional context and guidance.

Data Sharing for ML

More articles: