List Likeness in Python - Comparing Clients According to Their Functions

Question

List Likeness in Python - Comparing Clients According to Their Functions

I have a list of clients and features in the following format:

UserID, Feature1, Feature2, Feature3, Feature4

So, I have a list - "Clients" - and it looks like this:

 [ ['975676924', '1345207523', '-1953633084', '-2041119774', '587903155'], ['1619201613', '-1384105381', '1433106581', '1445361759', '587903155'], ['-1470352544', '-1068707556', '-1002282042', '-563691616', '587903155'], ['-1958275692', '-739953679', '69580355', '-481818422', '587903155'], ['1619201613', '-739953679', '-1002282042', '-481818422', '587903155'] ]

Each row represents a transaction with specific characteristics. The first element on each line is the UserID (client) that performs this transaction. Therefore, Customers[1] gives the second line, and Customers[1][0] gives the user ID of this line ( 1619201613 ).

User identifiers can be repeated on other lines (new transactions), as repeat clients will be added to the list. For example, note that Customers[4][0] gives the same user ID ( 1619201613 ), but the Customers[4] functions do not match the Customers[1] functions - that is, the customer returned and bought another product with different functions.

So, the main question: how to efficiently calculate the similarity between two separate customers on my list?
I think the question really should be divided into two different questions / tasks:

Grouping of individual user identifiers. So, the first question: how to effectively collect all the individual functions of one UserID, so that, for example, Customers[1] and Customers[4] are placed on one new line (new list?) Forms:
['1619201613', '-1384105381', '1433106581', '1445361759', '587903155', '-739953679', '-1002282042', '-481818422']
Finding Clients' similarities through their transactions. So, the second question: how to effectively evaluate the similarity function in [0,1] , which tells me whether two separate clients are interested in the same material?

PS. Some additional notes:

The order of the functions does not matter, since they are hashed and uniquely identified.
The power of the elements does not matter, i.e. it doesn’t matter to us if the same function appears twice or three times for the same UserID.
The end result of all this is the ability to get a customer network where user IDs are nodes and the edges between them are weighted by a similarity score.
I prefer cosine affinities or the Jaccard index, but open up alternatives.
I need speed and scalability, even if it sacrifices some accuracy, to a small extent, of course.
I carefully studied the previous questions - for example, the following are not relevant: Calculation of the similarity of the two lists ; Python Check multiple lists for similarities ; How to calculate the similarity between function lists?

+4

python list machine-learning similarity

mmScript May 29 '13 at 11:26

source share

2 answers

Step 1: Group an individual user hosting your list named l

 summary = {} # init a map for group for entry in l: if summary[entry[0]]: summary[entry[0]] += entry[1:] else: summary[entry[0]] = entry[1:] # delete duplicate element for s in summary: summary[s] = [int(x) for x in list(set(summary[s]))]

Step 2: Create a network, in fact a two-dimensional array, and calculate the similarities between different users.

 # the row and column number of this array cnt = len(summary) network = [[0] * cnt] * cnt index = [x for x in summary] for x, xvalue in enumerate(index): for y, yvalue in enumerate(index): common = len(set(summary[xvalue]) & set(summary[yvalue])) network[x][y] = common

Now the network is a two-dimensional array containing a common position number between each UserID.

For example, your list:

 [['100', '2', '3','4'], ['110', '2', '5', '6'], ['120', '6', '3', '4']]

Then the network:

 [[3, 1, 2], [1, 3, 1], [2, 1, 3]]

Some code is taken from this question.

+1

Roger May 29 '13 at 12:43

source share

Mike müller · Accepted Answer · 2013-05-29T12:55:07+0000

This is the answer to one of the questions:

 raw_data = [ ['975676924', '1345207523', '-1953633084', '-2041119774', '587903155'], ['1619201613', '-1384105381', '1433106581', '1445361759', '587903155'], ['-1470352544', '-1068707556', '-1002282042', '-563691616', '587903155'], ['-1958275692', '-739953679', '69580355', '-481818422', '587903155'], ['1619201613', '-739953679', '-1002282042', '-481818422', '587903155'] ] import collections data = collections.defaultdict(list) for line in raw_data: data[line[0]].extend(line[1:])

Now you have a dictionary with the id key:

 defaultdict(<type 'list'>, { '1619201613': ['-1384105381', '1433106581', '1445361759', '587903155', '-739953679', '-1002282042', '-481818422', '587903155'], '-1470352544': ['-1068707556', '-1002282042', '-563691616', '587903155'], '975676924': ['1345207523', '-1953633084', '-2041119774', '587903155'], '-1958275692': ['-739953679', '69580355', '-481818422', '587903155']})

You will get the desired list by rearranging:

 data_list = [[key] + value for key, value in data.items()]

List Likeness in Python - Comparing Clients According to Their Functions

More articles: