Python Pandas, create an empty DataFrame with a dtypes column

There is one thing that I have to do quite often, and it surprises me how difficult it is to achieve this in Pandas. Suppose I need to create an empty one DataFramewith the specified index type and name, as well as column types and names. (Maybe I want to fill it out later, for example, in a loop.) The easiest way to do this, which I found is to create an empty pandas.Seriesobject for each column by specifying them dtypes, placing them in a dictionary that defines their names and passes the dictionary to constructor DataFrame. Something like the following.

def create_empty_dataframe():
    index = pandas.Index([], name="id", dtype=int)
    column_names = ["name", "score", "height", "weight"]
    series = [pandas.Series(dtype=str), pandas.Series(dtype=int), pandas.Series(dtype=float), pandas.Series(dtype=float)]
    columns = dict(zip(column_names, series))
    return pandas.DataFrame(columns, index=index, columns=column_names)
    # The columns=column_names is required because the dictionary will in general put the columns in arbitrary order.

First question. Is this really the easiest way to do this? There are so many things that are confusing about this. What I really want to do, and what I'm sure many people really want to do, is something like the following.

df = pandas.DataFrame(columns=["id", "name", "score", "height", "weight"], dtypes=[int, str, int, float, float], index_column="id") 

Second question. Is this kind of syntax possible at all in Pandas? If not, are the developers willing to support something like this? It seems to me that it really should be as simple as that (the above syntax).

+4
source share
4 answers

Unfortunately, DateFramector accepts a single descriptor dtype, however you can trick a little by using read_csv:

In [143]:
import pandas as pd
import io
cols=["id", "name", "score", "height", "weight"]
df = pd.read_csv(io.StringIO(""), names=cols, dtype=dict(zip(cols,[int, str, int, float, float])), index_col=['id']) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 4 columns):
name      0 non-null object
score     0 non-null int32
height    0 non-null float64
weight    0 non-null float64
dtypes: float64(2), int32(1), object(1)
memory usage: 0.0+ bytes

, , dtypes :

In [145]:

df.index
Out[145]:
Int64Index([], dtype='int64', name='id')
+3

dtype DataFrame, :

df['column_name'] = df['column_name'].astype(float)
+1

,

def create_empty_dataframe():
    index = pandas.Index([], name="id", dtype=int)
    # specify column name and data type 
    columns = [('name', str),
               ('score', int),
               ('height', float),
               ('weight', float)]
    # create the dataframe from a dict
    return pandas.DataFrame({k: pandas.Series(dtype=t) for k, t in columns})

, , .

+1

, @Elliot:

import pandas as pd


def create_empty_DataFrame(columns, index_col):
    df = pd.DataFrame({name: pd.Series(dtype=t) for name, t in columns}).set_index(index_col)
    cols = [name for name, _ in columns]
    cols.remove(index_col)
    return df[cols]

, return df[cols], return df . :

columns = [
    ('id', str),
    ('primary', bool),
    ('side', str),
    ('quantity', int),
    ('price', float)]

table = create_empty_DataFrame(columns, 'id')

dtypes :

table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
primary     0 non-null bool
side        0 non-null object
quantity    0 non-null int64
price       0 non-null float64
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 0.0+ bytes

table.index

Index([], dtype='object', name='id')
0

Source: https://habr.com/ru/post/1648763/


All Articles