Saving strings in Pandas, where the same identifier is displayed more than n times and is converted to a list by identifier

Question

Saving strings in Pandas, where the same identifier is displayed more than n times and is converted to a list by identifier

I have a DataFrame as shown below:

  VID value 1 1 xx 2 2 xx1 3 2 xx2 4 2 xx3 5 2 xx4 6 3 xx 7 3 xx 8 3 xx 9 4 zz1 10 4 zz2 11 4 zz3 12 4 zz4 13 4 zz5

I want to save only rows whose VID exists in more than n = 3 cases. In the above example, I would save only lines 2-5 and 9-13 (since only VIDs 2 and 4 appear more than 3 times). The value field does not matter.

After saving the necessary lines, I would like to convert my data to a list of lists, adding the end value at the end of each:

 [[xx1, xx2, xx3, xx4, 'end'], [zz1, zz2, zz3, zz4, zz5, 'end']]

Is it possible to get the above without a for loop?

+5

python pandas

Stergios Oct 20 '15 at 15:38

source share

1 answer

Anand s kumar · Accepted Answer · 2015-10-20T15:46:41+0000

You can group the VID column and then take the amount of each group. Then after that use it to index your original df to only take rows with a number greater than 3. Example -

 countdf = df.groupby('VID').count() result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

Demo -

 In [49]: df Out[49]: VID value 1 1 xx 2 2 xx1 3 2 xx2 4 2 xx3 5 2 xx4 6 3 xx 7 3 xx 8 3 xx 9 4 zz1 10 4 zz2 11 4 zz3 12 4 zz4 13 4 zz5 In [51]: df.groupby('VID').count() Out[51]: value VID 1 1 2 4 3 3 4 5 In [52]: countdf = df.groupby('VID').count() In [53]: df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)] Out[53]: VID value 2 2 xx1 3 2 xx2 4 2 xx3 5 2 xx4 9 4 zz1 10 4 zz2 11 4 zz3 12 4 zz4 13 4 zz5

Then after that you can group again based on VID , and then convert the groups to list , and then again to list, example -

 resultlist = result.groupby('VID')['value'].apply(list).tolist()

Demo -

 In [54]: result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)] In [55]: result.groupby('VID')['value'].apply(list).tolist() Out[55]: [['xx1', 'xx2', 'xx3', 'xx4'], ['zz1', 'zz2', 'zz3', 'zz4', 'zz5']]

Please note: above you would not have the 'end' value in the list, I assume that this is not necessary, but if you really want it, you can manually add it after receiving the list. Example -

 resultlist = [elem + ['end'] for elem in resultlist]

Saving strings in Pandas, where the same identifier is displayed more than n times and is converted to a list by identifier

More articles: