日子有点无聊,照着<Programming collective Intelligence>写了一个delicious推荐的小模型(代码),就当作是再学python的hello, world! :p 以后无聊了就跑一下代码,看看有什么值得看的, HoHo~

运行结果如下:

>>> from delicious import *
>>> table=initializeUserDict('programming')
>>> table['myself']={}
>>> fillItems(table) 考虑到国内的网络,冲杯咖啡先吧...
>>> from recommendations import *
# 对用户username推荐n条网页链接
>>> getRecommendations(table,"username")[0:n]
# 输出结果前者为推荐值,后者为链接
[(0.61988282180389409, 'http://html5boilerplate.com/mobile/')]
...

deliciousapi

deliciousapi是非官方的api,用于从delicious上拿数据,它的接口很简洁,作者博客上有相应的demo,这里不废话.

initializeUserDict and fillItems

initializeUserDict(tag, count)首先通过deliciousapi.DeliciousAPI().get_urls(‘tag’)获得某个tag下的count链接,然后通过.get_url(‘url’)获取该链接相关的信息,e.g. bookmarks,tags.最后收集bookmarks对应的用户到字典user_dict,将其返回.代码如下:

def initializeUserDict(tag, count=1):
   user_dict={}
   dapi = deliciousapi.DeliciousAPI()

   for url in dapi.get_urls(tag=tag)[0:count]:
      for item in dapi.get_url(url).bookmarks:
         user_dict[item[0]]={}
   return user_dict

fillIteams(user_dict)遍历字典user_dict,通过.get_user(user)获取user收集的bookmarks.如果某个url被所有user收藏,那么将user_dict[user][url]置位,否则清零.代码如下:

def fillItems(user_dict):
   all_items={}
   dapi = deliciousapi.DeliciousAPI()

   for user in user_dict:
      for i in range(3):
         try:
            posts=dapi.get_user(user)
            break
         except:
            print "Fail user "+user+", retrying"
            time.sleep(4)
      for item in posts.bookmarks:
      	 url=item[0]
      	 user_dict[user][url]=1.0
      	 all_items[url]=1

   for ratings in user_dict.values():
      for item in all_items:
         if item not in ratings:
         	ratings[item]=0.0

经过initializeUserDict()和fillItems()两步后,得到的user_dict形如:

{'username': {'http://url': 0.0}, {'http://lru': 1.0},
 'nameuser': {'http://url': 0.0}, {'http://lru': 1.0},
  ...
}

getRecommendations

现在看看最核心的推荐算法,其实很简单:

def sim_distance(prefs, person1, person2):
   si={}
   for item in prefs[person1]:
      if item in prefs[person2]:
           si[item]=1

   if len(si)==0: return 0


   sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item], 2)
                     for item in prefs[person1] if item in prefs[person2]])

   return 1/(1+sqrt(sum_of_squares))

# gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs, person, similarity=sim_distance):
   totals={}
   simSums={}

   for other in prefs:
      if other==person: continue
      sim=similarity(prefs,person,other)

      if sim<=0: continue

      for item in prefs[other]:
         # only score those haven't seen yet
         if item not in prefs[person] or prefs[person][item]==0:
            #similarity * score
            totals.setdefault(item,0)
            totals[item]+=prefs[other][item]*sim
            # sum of similarities
            simSums.setdefault(item,0)
            simSums[item]+=sim

   #create the normolized list
   rankings=[(total/simSums[item],item) for item,total in totals.items()]

   # return the sorted list
   rankings.sort()
   rankings.reverse()
   return rankings

参数person表示被推荐人,而参数similarity表示计算Correlation and dependence的函数,默认值为sim_distance,其实也就是个Euclidean Distance.

Conclusion and next steps

小模型很简单,有空时可以完善一下算法.( 如果让C程序员看见如此大的稀疏矩阵,那得有多纠结 :( )