What is the best way to check for duplicate TEXT fields in MYSQL / PHP?

Question

What is the best way to check for duplicate TEXT fields in MYSQL / PHP?

My code extracts ~ 1000 HTML files, extracts the relevant information and then stores this information in the MySQL TEXT field (as it usually takes quite a while). I am looking for a system to prevent duplicate entries in the database

My first idea is to add a HASH field to the table (possibly MD5), display a hash list at the beginning of each run and check for duplicates before inserting into the database.

The second idea is to store the file length (bytes or characters or something else), index it and check for duplicate file lengths, double check the contents if duplicate lengths are found.

Not sure what the best performance solution is. Perhaps there is a better way?

If there is an effective way to check if there are 95% files, that would be ideal, but I doubt there are?

Thanks for any help!

BTW I am using PHP5 / Kohana

EDIT:

an idea appeared to check the similarity: I could read all the alphanumeric characters and register the appearance of each

e.g. 17aB ... = 1a, 7b, 10c, 27c, ...

potential problem would be the upper limit for a char count (about 61?)

I believe that false positives will still be rare.,.

good idea / bad idea?

+4

php mysql duplicates hash

jisaacstone Feb 04 '11 at 20:13

source share

2 answers

Sounds good, I implemented something similar. The hash field must be the key as duplicates are not allowed.

If each text entry is long, you can compute a constant multiple (say, 2) of hashes per entry. Then maybe if only one of them is identical, it is close enough. Obviously, the more hashes you have on the record, the closer you can compare the full text.

MD5 - 16 bytes. How many potential hashes will be over time? If that number remains reasonable, you should be fine using a comparison in memory.

0

Girish rao Feb 04 '11 at 20:25

source share

Byron whitlock · Accepted Answer · 2011-02-04T20:19:41+0000

The idea of a hash is probably the best. You may have collisions, but they will be extremely rare.

Make the hash field a unique key for the table and catch a duplicate error code. Or use insert ignore or insert replace .

What is the best way to check for duplicate TEXT fields in MYSQL / PHP?

More articles: