日子有点无聊,照着<Programming collective Intelligence>写了一个delicious推荐的小模型(代码),就当作是再学python的hello, world! :p 以后无聊了就跑一下代码,看看有什么值得看的, HoHo~
运行结果如下:
>>> from delicious import *
>>> table=initializeUserDict('programming')
>>> table['myself']={}
>>> fillItems(table) 考虑到国内的网络,冲杯咖啡先吧...
>>> from recommendations import *
# 对用户username推荐n条网页链接
>>> getRecommendations(table,"username")[0:n]
# 输出结果前者为推荐值,后者为链接
[(0.61988282180389409, 'http://html5boilerplate.com/mobile/')]
...
deliciousapi
deliciousapi是非官方的api,用于从delicious上拿数据,它的接口很简洁,作者博客上有相应的demo,这里不废话.
initializeUserDict and fillItems
initializeUserDict(tag, count)首先通过deliciousapi.DeliciousAPI().get_urls('tag')获得某个tag下的count链接,然后通过.get_url('url')获取该链接相关的信息,e.g. bookmarks,tags.最后收集bookmarks对应的用户到字典user_dict,将其返回.代码如下:
def initializeUserDict(tag, count=1): user_dict={} dapi = deliciousapi.DeliciousAPI() for url in dapi.get_urls(tag=tag)[0:count]: for item in dapi.get_url(url).bookmarks: user_dict[item[0]]={} return user_dict
fillIteams(user_dict)遍历字典user_dict,通过.get_user(user)获取user收集的bookmarks.如果某个url被所有user收藏,那么将user_dict[user][url]置位,否则清零.代码如下:
def fillItems(user_dict): all_items={} dapi = deliciousapi.DeliciousAPI() for user in user_dict: for i in range(3): try: posts=dapi.get_user(user) break except: print "Fail user "+user+", retrying" time.sleep(4) for item in posts.bookmarks: url=item[0] user_dict[user][url]=1.0 all_items[url]=1 for ratings in user_dict.values(): for item in all_items: if item not in ratings: ratings[item]=0.0
经过initializeUserDict()和fillItems()两步后,得到的user_dict形如:
{'username': {'http://url': 0.0}, {'http://lru': 1.0}, 'nameuser': {'http://url': 0.0}, {'http://lru': 1.0}, ... }
getRecommendations
现在看看最核心的推荐算法,其实很简单:
def sim_distance(prefs, person1, person2): si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 if len(si)==0: return 0 sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item], 2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sqrt(sum_of_squares)) # gets recommendations for a person by using a weighted average # of every other user's rankings def getRecommendations(prefs, person, similarity=sim_distance): totals={} simSums={} for other in prefs: if other==person: continue sim=similarity(prefs,person,other) if sim<=0: continue for item in prefs[other]: # only score those haven't seen yet if item not in prefs[person] or prefs[person][item]==0: #similarity * score totals.setdefault(item,0) totals[item]+=prefs[other][item]*sim # sum of similarities simSums.setdefault(item,0) simSums[item]+=sim #create the normolized list rankings=[(total/simSums[item],item) for item,total in totals.items()] # return the sorted list rankings.sort() rankings.reverse() return rankings
参数person表示被推荐人,而参数similarity表示计算Correlation and dependence的函数,默认值为sim_distance,其实也就是个Euclidean Distance.
Conclusion and next steps
小模型很简单,有空时可以完善一下算法.( 如果让C程序员看见如此大的稀疏矩阵,那得有多纠结 :( )