I am very new to pythonand pandas. Any recommendations, comments and suggestions appreciated!
Here is my problem: it takes several minutes to return the result after calling df.shapeor df.dtypes. DataFramehas lines 1,610,658 and 5 . Three columns are stored as int64, one as float64and one as datetime64.
I used the following codes to train the load and convert to python. Both loading and conversion have good performance, but I ran into this problem when I checked the output.
Update 1:
After setting some columns as an index, the time df.shapedrops from 80 + s to 1.7s , but df.dtypesstill remains at 80 + s
import pandas as pd
raw = pd.read_csv("data.zip", compression='zip')
payment_method = {
"Cash": 1
"Card": 2
}
df = raw. \
assign(
site = (raw.site == "A").astype(int),
payment =
[payment_method.get(k, 0) for k in raw.payment],
amount = raw.amount / 1e6,
sold_date= pd.to_datetime(
[str(dt) for dt in raw. sold_date],
format="%Y%m%d")
)
df.shape
df.dtypes
If I convert the data frame to numpy.ndarray, I can immediately get the result. I think I should miss something. Please give me some direction.
Thank you so much!
System: OS X 10.12 Python: 3.6.1 Scissors: 1.12 Pandas: 0.20.2 Jupiter Console: 5.1.0
source
share