日子有点无聊,照着<Programming collective Intelligence>写了一个delicious推荐的小模型(代码),就当作是再学python的hello, world! :p 以后无聊了就跑一下代码,看看有什么值得看的, HoHo~
运行结果如下:
>>> from delicious import *
>>> table=initializeUserDict('programming')
>>> table['myself']={}
>>> fillItems(table) 考虑到国内的网络,冲杯咖啡先吧...
>>> from recommendations import *
# 对用户username推荐n条网页链接
>>> getRecommendations(table,"username")[0:n]
# 输出结果前者为推荐值,后者为链接
[(0.61988282180389409, 'http://html5boilerplate.com/mobile/')]
...
deliciousapi
deliciousapi是非官方的api,用于从delicious上拿数据,它的接口很简洁,作者博客上有相应的demo,这里不废话.
initializeUserDict and fillItems
initializeUserDict(tag, count)首先通过deliciousapi.DeliciousAPI().get_urls('tag')获得某个tag下的count链接,然后通过.get_url('url')获取该链接相关的信息,e.g. bookmarks,tags.最后收集bookmarks对应的用户到字典user_dict,将其返回.代码如下:
def initializeUserDict(tag, count=1):
user_dict={}
dapi = deliciousapi.DeliciousAPI()
for url in dapi.get_urls(tag=tag)[0:count]:
for item in dapi.get_url(url).bookmarks:
user_dict[item[0]]={}
return user_dict
fillIteams(user_dict)遍历字典user_dict,通过.get_user(user)获取user收集的bookmarks.如果某个url被所有user收藏,那么将user_dict[user][url]置位,否则清零.代码如下:
def fillItems(user_dict):
all_items={}
dapi = deliciousapi.DeliciousAPI()
for user in user_dict:
for i in range(3):
try:
posts=dapi.get_user(user)
break
except:
print "Fail user "+user+", retrying"
time.sleep(4)
for item in posts.bookmarks:
url=item[0]
user_dict[user][url]=1.0
all_items[url]=1
for ratings in user_dict.values():
for item in all_items:
if item not in ratings:
ratings[item]=0.0
经过initializeUserDict()和fillItems()两步后,得到的user_dict形如:
{'username': {'http://url': 0.0}, {'http://lru': 1.0},
'nameuser': {'http://url': 0.0}, {'http://lru': 1.0},
...
}
getRecommendations
现在看看最核心的推荐算法,其实很简单:
def sim_distance(prefs, person1, person2):
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
if len(si)==0: return 0
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item], 2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sqrt(sum_of_squares))
# gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs, person, similarity=sim_distance):
totals={}
simSums={}
for other in prefs:
if other==person: continue
sim=similarity(prefs,person,other)
if sim<=0: continue
for item in prefs[other]:
# only score those haven't seen yet
if item not in prefs[person] or prefs[person][item]==0:
#similarity * score
totals.setdefault(item,0)
totals[item]+=prefs[other][item]*sim
# sum of similarities
simSums.setdefault(item,0)
simSums[item]+=sim
#create the normolized list
rankings=[(total/simSums[item],item) for item,total in totals.items()]
# return the sorted list
rankings.sort()
rankings.reverse()
return rankings
参数person表示被推荐人,而参数similarity表示计算Correlation and dependence的函数,默认值为sim_distance,其实也就是个Euclidean Distance.
Conclusion and next steps
小模型很简单,有空时可以完善一下算法.( 如果让C程序员看见如此大的稀疏矩阵,那得有多纠结 :( )