Optimize the algorithm to create a list of items evaluated together in Python

Question

Optimize the algorithm to create a list of items evaluated together in Python

taking into account the list of purchase events (customer_id, item)

1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws

I am trying to create a data structure that reports how many times an item has been purchased with another item. I didn’t buy it at the same time, but I bought it, since I started saving data. the result will look like

{
       hammer : {screwdriver : 1, nails : 2}, 
  screwdriver : {hammer : 1, screws : 1, nails : 1}, 
       screws : {screwdriver : 1, nails : 1}, 
        nails : {hammer : 1, screws : 1, screwdriver : 1}
}

indicating that the hammer was bought twice with nails (person 1.3) and with a screwdriver once (person 1), the screws were bought with a screwdriver once (person 3), etc.

my current approach

users = dict, where userid is the key, and the list of items purchased is the value

usersForItem = dict, where itemid is the key, and the list of users who bought the item is the value

userlist = temporary list of users who rated the current item

pseudo:
for each event(customer,item)(sorted by item):
  add user to users dict if not exists, and add the items
  add item to items dict if not exists, and add the user
----------

for item,user in rows:

  # add the user to the users dict if they don't already exist.
  users[user]=users.get(user,[])

  # append the current item_id to the list of items rated by the current user
  users[user].append(item)

  if item != last_item:
    # we just started a new item which means we just finished processing an item
    # write the userlist for the last item to the usersForItem dictionary.
    if last_item != None:
      usersForItem[last_item]=userlist

    userlist=[user]

    last_item = item
    items.append(item)
  else:
    userlist.append(user)

usersForItem[last_item]=userlist

, 2 - . , . , userForItem , , , , . , - , ( ), Python.

relatedItems = {}
for key,listOfUsers in usersForItem.iteritems():
  relatedItems[key]={}
  related=[]

  for ux in listOfReaders:
    for itemRead in users[ux]:
      if itemRead != key:
        if itemRead not in related:
          related.append(itemRead)
        relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1    

  calc jaccard/tanimoto similarity between relatedItems[key] and its values

, ? , , .

edit: , , , . .

+3

optimization python algorithm similarity

Neil Kodner 24 . '10 12:09

4

? , , .. ?

. , .

0, 1, , , , , .

( 5000) 0s 1s, , , !

, , dot- , .

:

0s 1s , .

5000 79 64- .

, , 1s, .

, , , 1, .

( , python): http://graphics.stanford.edu/~seander/bithacks.html

, :

79 64- .
.
, , , , dot-, .

.

+3

Aryabhatta 24 . '10 14:03

, (, , ). / . - MongoDB, NoSQL, , ( map/reduce all)

# assuming events is a dictionary of id keyed to item bought...
user = {}
for (cust_id, item) in events:
    if not cust_id in users:
        user[cust_id] = set()
    user[cust_id].add(item)
# now we have a dictionary of cust_ids keyed to a set of every item
# they've ever bought (given that repeats don't matter)
# now we construct a dict of items keyed to a dictionary of other items
# which are in turn keyed to num times present
items = {}
def insertOrIter(d, k, v):
    if k in d:
        d[k] += v
    else:
        d[k] = v
for key in user:
    # keep track of items bought with each other
    itemsbyuser = []
    for item in user[key]:
        # make sure the item with dict is set up
        if not item in items:
            items[item] = {}
        # as we see each item, add to it others and others to it
        for other in itemsbyuser:
            insertOrIter(items[other], item, 1)
            insertOrIter(items[item], other, 1)
        itemsbyuser.append(item)
# now, unless i've screwed up my logic, we have a dictionary of items keyed
# to a dictionary of other items keyed to how many times they've been
# bought with the first item. *whew* 
# If you want something more (potentially) useful, we just turn that around to be a
# dictionary of items keyed to a list of tuples of (times seen, other item) and
# you're good to go.
useful = {}
for i in items:
    temp = []
    for other in items[i]:
        temp[].append((items[i][other], other))
    useful[i] = sorted(temp, reverse=True)
# Now you should have a dictionary of items keyed to tuples of
# (number times bought with item, other item) sorted in descending order of
# number of times bought together

+1

nearlymonolith 24 . '10 17:15

, , , , , .

, , . , .

from collections import defaultdict
from itertools import groupby

class myDB:
    '''Example of "indexed" "database" of orders <-> items on order'''
    def __init__(self):
        self.id_based_index = defaultdict(set) 
        self.item_based_index = defaultdict(set)

    def add(self, order_data):
        for id, item in order_data:
            self.id_based_index[id].add(item)
            self.item_based_index[item].add(id)

    def get_compliments(self, item):
        all_items = []
        for id in self.item_based_index[item]:
            all_items.extend(self.id_based_index[id])
        gi = groupby(sorted(all_items), lambda x: x)
        return dict([(k, len(list(g))) for k, g in gi])

:

events = """1-hammer 
    1-screwdriver 
    1-nails 
    2-hammer 
    2-nails 
    3-screws 
    3-screwdriver 
    4-nails 
    4-screws"""

db = myDB()
db.add(
    [ map(str.strip,e.split('-')) for e in events.splitlines() ]
    )
# index is incrementally increased 
db.add([['5','plunger'],['5','beer']])

# this scans and counts only needed items
assert db.get_compliments('NotToBeFound') == {}
assert db.get_compliments('hammer') == {'nails': 2, 'hammer': 2, 'screwdriver': 1}
# you get back the count for the requested product as well. Discard if not needed.

It's fun, but seriously, just go for a real database repository. Since indexing is already built into any database engine, all of the above code in SQL will only:

select
    p_others.product_name,
    count(1) cnt
from products p
join order_product_map opm
    on p.product_id = opm.product_id
join products p_others
    on opm.product_id = p_others.product_id
where p.product_name in ('hammer')
group by p_others.product_name

+1

ddotsenko Jun 24 '10 at 20:26

source share

PaulMcG · Accepted Answer · 2010-06-24T16:27:57+0000

events = """\
1-hammer 
1-screwdriver 
1-nails 
2-hammer 
2-nails 
3-screws 
3-screwdriver 
4-nails 
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)

from collections import defaultdict
from itertools import groupby

# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
    items = sorted(it[1] for it in items)
    for i,item1 in enumerate(items):
        for item2 in items[i+1:]:
            summary[(item1,item2)] += 1
            summary[(item2,item1)] += 1

# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
    item1, item2 = k
    pairmap[item1][item2] = v

# print the results    
for k,v in sorted(pairmap.items()):
    print k,':',v

:

hammer : {'nails': 2, 'screwdriver': 1}
nails : {'screws': 1, 'hammer': 2, 'screwdriver': 1}
screwdriver : {'screws': 1, 'nails': 1, 'hammer': 1}
screws : {'nails': 1, 'screwdriver': 1}

( . , .)

Optimize the algorithm to create a list of items evaluated together in Python

More articles: