Low resource usage when using python deduplication

Question

Low resource usage when using python deduplication

I need to find duplicates in a large dataset, so I am testing the python dedupe library .

I know this is recommended for small datasets, so I thought using a good machine could improve performance. I have a machine with 56 GB of RAM, and I run a test like "csv_example" for a data set with 200,000 rows. It works, but memory usage is very low, therefore processing (CPU).

There seems to be too much time at the blocking stage:

INFO:dedupe.blocking:10000, 110.6458142 seconds
INFO:dedupe.blocking:20000, 300.6112282 seconds
INFO:dedupe.blocking:30000, 557.1010122 seconds
INFO:dedupe.blocking:40000, 915.3087222 seconds

Can someone help me improve the use or tell me if there is any library / setting that forces the program to use more accessible resources?

+4

python pyspark record-linkage python-dedupe

mjimcua Jun 01 '17 at 13:15

source share

1 answer

fgregg · Accepted Answer · 2017-06-12T00:52:01+0000

What version of dedupe are you using? Starting with 1.6.8, it should easily handle a set of records of this size.

However, the general guideline is that when you encounter memory problems, switch to locking using a database like postgres , for example .

(I am the main author of Dedup).

Low resource usage when using python deduplication

More articles: