I have a data frame and I need to filter it according to the following conditions
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'
when i try to do it with
df1 = df.query(condition1)
df2 = df.query(condition2)
I get a memory error (since the size of my file system is huge).
SO I planned to filter the main condition and then the condition, so that the load will be less and the performance will be better.
In the analysis of the above conditions, somehow managed to get
main_filter = "CITY == 'Mumbai'"
sub_cond1 = "LANGUAGE == 'English'"
sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1"
sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1"
sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1"
sub_cond2_cond1 = "GENRE == 'COMEDY'"
So think that this is a tree structure (and not binary, of course, and in fact it is not a tree at all).
Now I want to follow the multiprocessing method (deep subprocess under the subprocess)
Now i want something like
on level 1
df = df_main.query(main_filter)
on level 2
df1 = df.query(sub_cond1)
df2 = df.query(sub_cond2)
onlevel 3
df11 = df1.query(sub_cond1_cond1)
df12 = df1.query(sub_cond1_cond2)
df21 = df2.query(sub_cond2_cond1) ######like this
, , ( ( )).
NB: csvs.
:
df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')
, , ( .., ). , , . , - .
- codeline .
( node), .