List Likeness in Python - Comparing Clients According to Their Functions

I have a list of clients and features in the following format:

UserID, Feature1, Feature2, Feature3, Feature4 

So, I have a list - "Clients" - and it looks like this:

 [ ['975676924', '1345207523', '-1953633084', '-2041119774', '587903155'], ['1619201613', '-1384105381', '1433106581', '1445361759', '587903155'], ['-1470352544', '-1068707556', '-1002282042', '-563691616', '587903155'], ['-1958275692', '-739953679', '69580355', '-481818422', '587903155'], ['1619201613', '-739953679', '-1002282042', '-481818422', '587903155'] ] 

Each row represents a transaction with specific characteristics. The first element on each line is the UserID (client) that performs this transaction. Therefore, Customers[1] gives the second line, and Customers[1][0] gives the user ID of this line ( 1619201613 ).

User identifiers can be repeated on other lines (new transactions), as repeat clients will be added to the list. For example, note that Customers[4][0] gives the same user ID ( 1619201613 ), but the Customers[4] functions do not match the Customers[1] functions - that is, the customer returned and bought another product with different functions.

So, the main question: how to efficiently calculate the similarity between two separate customers on my list?
I think the question really should be divided into two different questions / tasks:

  • Grouping of individual user identifiers. So, the first question: how to effectively collect all the individual functions of one UserID, so that, for example, Customers[1] and Customers[4] are placed on one new line (new list?) Forms:
    ['1619201613', '-1384105381', '1433106581', '1445361759', '587903155', '-739953679', '-1002282042', '-481818422']

  • Finding Clients' similarities through their transactions. So, the second question: how to effectively evaluate the similarity function in [0,1] , which tells me whether two separate clients are interested in the same material?


PS. Some additional notes:

  • The order of the functions does not matter, since they are hashed and uniquely identified.
  • The power of the elements does not matter, i.e. it doesn’t matter to us if the same function appears twice or three times for the same UserID.
  • The end result of all this is the ability to get a customer network where user IDs are nodes and the edges between them are weighted by a similarity score.
  • I prefer cosine affinities or the Jaccard index, but open up alternatives.
  • I need speed and scalability, even if it sacrifices some accuracy, to a small extent, of course.
  • I carefully studied the previous questions - for example, the following are not relevant: Calculation of the similarity of the two lists ; Python Check multiple lists for similarities ; How to calculate the similarity between function lists?
+4
source share
2 answers

This is the answer to one of the questions:

 raw_data = [ ['975676924', '1345207523', '-1953633084', '-2041119774', '587903155'], ['1619201613', '-1384105381', '1433106581', '1445361759', '587903155'], ['-1470352544', '-1068707556', '-1002282042', '-563691616', '587903155'], ['-1958275692', '-739953679', '69580355', '-481818422', '587903155'], ['1619201613', '-739953679', '-1002282042', '-481818422', '587903155'] ] import collections data = collections.defaultdict(list) for line in raw_data: data[line[0]].extend(line[1:]) 

Now you have a dictionary with the id key:

 defaultdict(<type 'list'>, { '1619201613': ['-1384105381', '1433106581', '1445361759', '587903155', '-739953679', '-1002282042', '-481818422', '587903155'], '-1470352544': ['-1068707556', '-1002282042', '-563691616', '587903155'], '975676924': ['1345207523', '-1953633084', '-2041119774', '587903155'], '-1958275692': ['-739953679', '69580355', '-481818422', '587903155']}) 

You will get the desired list by rearranging:

 data_list = [[key] + value for key, value in data.items()] 
+1
source

Step 1: Group an individual user hosting your list named l

 summary = {} # init a map for group for entry in l: if summary[entry[0]]: summary[entry[0]] += entry[1:] else: summary[entry[0]] = entry[1:] # delete duplicate element for s in summary: summary[s] = [int(x) for x in list(set(summary[s]))] 

Step 2: Create a network, in fact a two-dimensional array, and calculate the similarities between different users.

 # the row and column number of this array cnt = len(summary) network = [[0] * cnt] * cnt index = [x for x in summary] for x, xvalue in enumerate(index): for y, yvalue in enumerate(index): common = len(set(summary[xvalue]) & set(summary[yvalue])) network[x][y] = common 

Now the network is a two-dimensional array containing a common position number between each UserID.

For example, your list:

 [['100', '2', '3','4'], ['110', '2', '5', '6'], ['120', '6', '3', '4']] 

Then the network:

 [[3, 1, 2], [1, 3, 1], [2, 1, 3]] 

Some code is taken from this question.

+1
source

Source: https://habr.com/ru/post/1483341/


All Articles