Convert pandas.DataFrame to bytes

I need to convert the data stored in pandas.DataFrame to a string of bytes, where each column can have a separate data type (integer or floating point). Here is a simple data set:

 df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a']) df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8') df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8') 

and df looks something like this:

  abc 0 10 18446744073709551615 1.324000e+10 1 15 230498234019 3.141590e+00 2 20 32094812309 2.341341e+02 

DataFrame knows about the types of each df.dtypes column, so I would like to do something like this:

 data_to_pack = [tuple(record) for _, record in df.iterrows()] data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes)) data_bytes = data_array.tostring() 

This usually works fine, but in this case (due to the maximum value stored in df['b'][0] . The second line above, converting the tuple array to np.array with the given set of types, causes the following error:

 OverflowError: Python int too large to convert to C long 

The result of the error (I believe) in the first line, which retrieves the record as Series with one data type (default is float64 ) and the representation selected in float64 for the maximum value of uint64 back to not back to uint64 .

1) Since the DataFrame already knows the types of each column, is there a way around the creation of a tuple string for input into the numpy.array typed constructor? Or is there a better way than stated above to store type information in such a conversion?

2) Is there a way to go directly from the DataFrame to a byte string representing the data using type information for each column.

+5
source share
1 answer

You can use df.to_records() to convert your dataframe to numpy repeat, then call .tostring() to convert this to a byte string

 rec = df.to_records(index=False) print(repr(rec)) # rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159), # (20, 32094812309, 234.1341)], # dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')]) s = rec.tostring() rec2 = np.fromstring(s, rec.dtype) print(np.all(rec2 == rec)) # True 
+2
source

Source: https://habr.com/ru/post/1240124/


All Articles