I have a list as below:
saleid upc
0 155_02127453_20090616_135212_0021 02317639000000
1 155_02127453_20090616_135212_0021 00000000000888
2 155_01605733_20090616_135221_0016 00264850000000
3 155_01072401_20090616_135224_0010 02316877000000
4 155_01072401_20090616_135224_0010 05051969277205
He represents one customer (saleid) and the items he received (at the top of the item)
I want this table to collapse into a form, as shown below:
02317639000000 00000000000888 00264850000000 02316877000000
155_02127453_20090616_135212_0021 1 1 0 0
155_01605733_20090616_135221_0016 0 0 1 0
155_01072401_20090616_135224_0010 0 0 0 0
Thus, the columns are unique UPC, and the rows are unique SALEID.
I read like this:
tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str})
tbl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18570726 entries, 0 to 18570725
Data columns (total 2 columns):
saleid object
upc object
dtypes: object(2)
memory usage: 283.4+ MB
I took a few steps, but not the right one!
tbl.pivot_table(columns=['upc'],aggfunc=pd.Series.nunique)
upc 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ...
saleid 44950 287 26180 4881 1839 623 3347 7
EDIT: Im using the solution option below:
chunksize = 1000000
f = 0
for chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize):
print(f)
t = pd.crosstab(chunk.saleid, chunk.upc)
t.head(3)
t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=True,sep=';',compression='bz2')
f = f+1
The source file is extremely large to fit memory after conversion. The above solution has a problem with missing all columns in all files as I am reading fragments from the source file.
Question 2: is there a way to make all the pieces have the same columns?