Pandas: Build a data frame containing a column of tuples

I have a custom CSV file that looks something like this:

x,y
1,"(5, 27, 4)"
2,"(3, 1, 6, 2)"
3,"(4, 5)"

Using pd.read_csv()leads to something that is not all that useful, because tuples are not parsed. There are existing answers that relate to this ( 1 , 2 ), but since these tuples are heterogeneous in length, these answers are not entirely useful for the problem I am facing.

What I would like to do is plot xvs yusing pandas routines. The naive approach leads to an error because tuples are stored as strings:

>>> # df = pd.read_csv('data.csv')
>>> df = pd.DataFrame({'x': [1, 2, 3],
                       'y': ["(5, 27, 4)","(3, 1, 6, 2)","(4, 5)"]})
>>> df.plot.scatter('x', 'y')
[...]
ValueError: scatter requires y column to be numeric

The result I hope for looks something like this:

import numpy as np
import matplotlib.pyplot as plt
for x, y in zip(df['x'], df['y']):
    y = eval(y)
    plt.scatter(x * np.ones_like(y), y, color='blue')

enter image description here

Pandas df.plot.scatter() ( eval())?

+4
2

df plot

In [3129]: s = df.y.map(ast.literal_eval)

In [3130]: dff = pd.DataFrame({'x': df.x.repeat(s.str.len()).values,
                               'y': np.concatenate(s.values)})

In [3131]: dff
Out[3131]:
   x   y
0  1   5
1  1  27
2  1   4
3  2   3
4  2   1
5  2   6
6  2   2
7  3   4
8  3   5

,

dff.plot.scatter('x', 'y')
+1

.str , .str.extractall:

# Index by 'x' to retain its values once we extract from 'y'
df = df.set_index('x')

# Extract integers from 'y'
df = df['y'].str.extractall(r'(\d+)')[0].astype('int64')

# Rename and reset the index (remove 'match' level, get 'x' as column)
df = df.rename('y').reset_index(level='match', drop=True).reset_index()

float ints, astype .

DataFrame, :

   x   y
0  1   5
1  1  27
2  1   4
3  2   3
4  2   1
5  2   6
6  2   2
7  3   4
8  3   5

df.plot.scatter('x', 'y') .

+1

Source: https://habr.com/ru/post/1687574/


All Articles