My code extracts ~ 1000 HTML files, extracts the relevant information and then stores this information in the MySQL TEXT field (as it usually takes quite a while). I am looking for a system to prevent duplicate entries in the database
My first idea is to add a HASH field to the table (possibly MD5), display a hash list at the beginning of each run and check for duplicates before inserting into the database.
The second idea is to store the file length (bytes or characters or something else), index it and check for duplicate file lengths, double check the contents if duplicate lengths are found.
Not sure what the best performance solution is. Perhaps there is a better way?
If there is an effective way to check if there are 95% files, that would be ideal, but I doubt there are?
Thanks for any help!
BTW I am using PHP5 / Kohana
EDIT:
an idea appeared to check the similarity: I could read all the alphanumeric characters and register the appearance of each
e.g. 17aB ... = 1a, 7b, 10c, 27c, ...
potential problem would be the upper limit for a char count (about 61?)
I believe that false positives will still be rare.,.
good idea / bad idea?
source share