Combining duplicates in a list? - The question is more complicated than it seems

So, I have a huge list of records in the database ( MySql )

I use Python and Django when building my web application.

This is the basic Django model that I use:

class DJ(models.Model): alias = models.CharField(max_length=255) #other fields... 

In my DB I now have duplicates

eg. Higher and higher, higher and higher, higher, DJ higher and higher, Jokey Above and Beyond, ...

This is a problem ... because it hits a big hole in my DB and therefore my application.


I am sure that other people have encountered this problem and thought about it.

My ideas are as follows:

  • Create a rule set to not create a new record?

    eg. "DJ Above and Beyond" cannot be because "Above and Below" is in DB

  • Bind these pseudonyms to each other in some way?

    eg. tie "DJ above and above" to "above and below"


I literally have no clue how to do this, even if someone can point me in a direction that would be very helpful.

Any help would be greatly appreciated! Thanks guys.

+4
source share
9 answers

I think you could do something based on Levenshtein distance , but there is no real way to do this automatically - without creating quite complex rules based on the system.

If you cannot define a rule system that can work for any x and y , whether x duplicate of y , you will have to deal with this in a fuzzy, human way.

Stack Overflow has a pretty decent way to handle this - to alert users if something could be duplicate based on something like Levenshtein distance (and maybe some kind of rule engine), and then allow a subset of your users to merge things like Duplicates if other users ignore warnings.

+4
source

From the examples you cited, it looks like you have more problems with the natural language than with the exact matching problem. Given that the natural correspondence of languages ​​is inaccurate in nature, you are unlikely to come up with an ideal solution.

  • String spacing doesn’t really work, since strings that are algorithmically close may not be semantically close (for example, “DJ Above and Beyond” should match “Above and Beyond”, but not “DJ Above and Beyond 2”, which is closer to Distance Levenshtein.
  • Some cheap alternatives to soundex natural language analysis , which will match phonetic sounds, and Stemming , which removes prefixes / suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but that would not be very accurate.
  • If this is a program that interacts with the user, you can echo the user, for example. "Is this one of those that you wanted to introduce?"
  • You can normalize the records in some way, so that different records are displayed on the same normalized value (for example, normalize the case, "&" → "AND", etc. etc., which are some of the sentences above may be a step in the direction) to find near misses or match multiple inputs with a single value.

Add a caution that my experience only applies to English, for example. English PorterStemmer does not recognize a single French name that you posted there.

+3
source

I think this is more of a social problem than a programming problem. Any natural language processing software solution like this would be a mistake and a mistake. It is very difficult to distinguish close but legally distinct from the unwanted duplicates you are talking about.

As Dominic mentioned, the tag system is a pretty good model for this. It gives the user tips that encourage them to use existing tags, if necessary (drop-down lists as user types), this allows trusted users to reconfigure individual questions, and this allows moderators to perform mass retouching.

This is truly a process in which a person must be involved.

+2
source

This is not a complete solution, but I thought:

 class DJ(models.Model): #other fields, no alias! class DJAlias(models.Model): dj = models.ForeignKey(DJ) 

This will allow you to have multiple aliases for the same dj.

But still you will need to find the right way to ensure that aliases are added to the right dj. See the Dominics publication .

But if you check the alias against several other aliases pointing to the same dj, the algorithms may work better.

+1
source

You can try to solve this problem only for this instance (replacing "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc.). If your table contains only a DJ, you can create a bunch of such rules. If your table contains more diverse materials, you will have to go with a more structured approach. Could you give a sample of your data set?

+1
source

First of all, of course, the programming problem (NLP, etc.) is interesting. But, as already mentioned, this is an excessive desire for excellence.

But another point of view, as mentioned ("social"), who enters the data, who considers it, for how long and how correct is it? So this naming convention reminds me of a great musicbrainz.org project - if your site "just works" or you prefer to follow the standards, in the latter case I would be guided by the mb project - This is done and I have not heard about it. i.e. see here above and above: they have a specific alias, they use it to match user searches. http://musicbrainz.org/show/artist/aliases.html?artistid=58438 check also the Artist_Alias ​​wiki page.

The data model is noteworthy, and there are even several API bindings for data synchronization, also in python.

+1
source

How about changing the model so that the "alias" is a list of keys to another table that looks like this (skipping small words like "", "and" etc.): 1 => above; 2 => Beyond; 3 => Disk; 4 => Jokey;

Then, when you want to insert a new record, simply check how many significant words from the title are already in the table and match existing existing model objects. If more than 50% (for example), perhaps you have a match, and you can show their list to the visitor and ask: "You mean some of this."

+1
source

It seems fuzzywuzzy is perfect for your needs.

This article explains why it was configured, which very closely matches your requirements - mainly for handling situations in which two different things were named slightly differently:

One of our most disappointing problems is to find out if two ticket lists are for the same real event (i.e. without involving our army of interns) ....
To achieve this, we created a library of "fuzzy" string routines to help us.

0
source

If you are only after the names of artists or generally associated with the name holders, it would be much better to use the last.fm or echonest APIs, since they already have a huge set of rules and a huge database on which to install.

0
source

Source: https://habr.com/ru/post/1299007/


All Articles