I have a lot of csv data, about 40 GB in size, that I need to process (let's call it the "body"). The data in each file in this body consists of CSV files with one column. Each line is a keyword consisting of words and short sentences, for example
Dog
Feeding cat
used cars in Brighton
trips to London
.....
This data must be compared with another set of files (this size is 7 GB, which I will call "Removals"), any keywords from Removals must be identified and removed from the body. The data for Removals is similar to the data in the body, that is:
Guns
pricless ming vases
trips to London
pasta recipes
........
, , . , 7 for . Removals body, , . :
def thread_worker(file_):
removal_path="removal_files"
allFiles_removals = glob.glob(removal_path + "/*.csv", recursive=True)
print(allFiles_removals)
print(file_)
file_df = pd.read_csv(file_)
file_df.columns = ['Keyword']
for removal_file_ in allFiles_removals:
print(removal_file_)
vertical_df = pd.read_csv(vertical_file_, header=None)
vertical_df.columns = ['Keyword']
vertical_keyword_list = vertical_df['Keyword'].values.tolist()
file_df = file_df[~file_df['Keyword'].isin(vertical_keyword_list)]
file_df.to_csv('output.csv',index=False, header=False, mode='a')
, - , . Pandas ? CSV.