Comparing large (~ 40 GB) text data using Pandas or an alternative approach

Question

Comparing large (~ 40 GB) text data using Pandas or an alternative approach

I have a lot of csv data, about 40 GB in size, that I need to process (let's call it the "body"). The data in each file in this body consists of CSV files with one column. Each line is a keyword consisting of words and short sentences, for example

Dog
Feeding cat
used cars in Brighton
trips to London
.....

This data must be compared with another set of files (this size is 7 GB, which I will call "Removals"), any keywords from Removals must be identified and removed from the body. The data for Removals is similar to the data in the body, that is:

Guns
pricless ming vases
trips to London
pasta recipes
........

, , . , 7 for . Removals body, , . :

def thread_worker(file_):


    removal_path="removal_files"
    allFiles_removals = glob.glob(removal_path + "/*.csv", recursive=True)
    print(allFiles_removals)

    print(file_)
    file_df = pd.read_csv(file_)

    file_df.columns = ['Keyword']

    for removal_file_ in allFiles_removals:

        print(removal_file_)
        vertical_df = pd.read_csv(vertical_file_, header=None)

        vertical_df.columns = ['Keyword']

        vertical_keyword_list = vertical_df['Keyword'].values.tolist()

        file_df = file_df[~file_df['Keyword'].isin(vertical_keyword_list)]


    file_df.to_csv('output.csv',index=False, header=False, mode='a')

, - , . Pandas ? CSV.

+4

python pandas csv

GreenGodot 23 . '17 10:11

1

MaxU · Answer 1 · 2017-01-24T13:59:42+0000

IIUC :

# read up "removal"  keywords from all CSV files, get rid of duplicates
removals = pd.concat([pd.read_csv(f, sep='~', header=None, names=['Keyword']) for f in removal_files]
                     ignore_index=True).drop_duplicates()


df = pd.DataFrame()
for f in body_files:
    # collect all filtered "body" data (file-by-file)
    df = pd.concat([df,
                    pd.read_csv(f, sep='~', header=None, names=['Keyword']) \
                      .query('Keyword not in @removals.Keyword')],
                   ignore_index=True)

Comparing large (~ 40 GB) text data using Pandas or an alternative approach

More articles: