要求如下
上述方法会受到倍数的影响,因此我们要将其标准化
聚类属于一种无监督学习,输入的资料没有标签
import graphlab
people = graphlab.SFrame('people_wiki.sframe')
Get the word counts for Obama article
obama['word_count'] = graphlab.text_analytics.count_words(obama['text']
Sort the word counts for the obama article
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name=['word','count'])
Compute TF-IDF for the corpus
people['word_count'] = graphlab.text_analytics.count_words(people['text'])#先将文进行分析
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])#使用tf_idf直接求得我们的目标
Examine the TF-IDF for the Obama article
obama = people[people['name']=='Barack Obama']#先选出obama的数据
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)#再进行tfidf计算及排序
Is Obama closer to clinton or beckham
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])# 计算余弦距离
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])
Build a nearest neighbor model for ducument retrival
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')#knn模型创建
直接用query方法直接调用knn_model
knn_model.query(beckham)