PySpark对电影和用户进行聚类分析

Felcia ·
更新时间:2024-11-13
· 506 次阅读

之前的博文使用pyspark.mllib.recommendation做推荐案例,代码实现了如何为用户做电影推荐和为电影找到潜在的感兴趣用户。本篇博文介绍如何利用因子分解出的用户特征、电影特征做用户和电影的聚类分析,以看能否找到不同于已知的、有趣的新信息。

第一步:获取用户评分数据显式因式分解后的movieFactors、userFactors。

from pyspark.mllib.recommendation import ALS,Rating #用户评分数据 rawData = sc.textFile("/Users/gao/data/ml-100k/u.data") rawRatings = rawData.map(lambda line:line.split("\t")[:3]) #构造user-item-rating 数据 ratings = rawRatings.map(lambda line:Rating(user=int(line[0]),product=int(line[1]),rating=float(line[2]))) #评分数据放入缓存中 ratings.cache() #模型训练 alsModel = ALS.train(ratings, rank=50, iterations=10, lambda_=0.1) #根据因式分解出的 productFeatures\userFeatures from pyspark.mllib.linalg import Vectors movieFactors = alsModel.productFeatures().map(lambda product_features:(product_features[0],Vectors.dense(product_features[1]))) movieVectors = movieFactors.map(lambda line:line[1]) userFactors = alsModel.userFeatures().map(lambda user_features:(user_features[0],Vectors.dense(user_features[1]))) userVectors = userFactors.map(lambda line:line[1])

第二步,读入电影题材数据

#电影题材数据 genres = sc.textFile("/Users/gao/Desktop/Toby/5Spark-JDK/data/ml-100k/u.genre") #电影题材数据按照 | 分割 genreMap = genres.filter(lambda line:line!='' and len(line)>4).map(lambda line:line.split('|')).map(lambda arr:(arr[1],arr[0])).collectAsMap() print(genreMap) #电影名称和题材标注数据 movies = sc.textFile("/Users/gao/data/ml-100k/u.item") #获取电影名称和题材名对应数据 import numpy as np def titles_and_genres(line,genreMap): arr = line.split('|') idx = int(arr[0]) title = str(arr[1]) genre_list = np.array([int(i) for i in arr[5:]]) index_list = [index for index in np.where(genre_list==1)[0]] genresAssigned = [genreMap[str(index)] for index in index_list] print(genresAssigned) return (idx,(title,genresAssigned)) genreMap_bcast = sc.broadcast(genreMap) titlesAndGenres = movies.map(lambda line:titles_and_genres(line,genreMap_bcast.value))

第三步,查看特征数据是否需要规范化

#查看特征数据是否需要规范化 from pyspark.mllib.linalg.distributed import RowMatrix movieMatrix = RowMatrix(movieVectors) movieMatrixSummary =movieMatrix.computeColumnSummaryStatistics() userMatrix = RowMatrix(userVectors) userMatrixSummary =userMatrix.computeColumnSummaryStatistics() print("Movie factors mean: {}".format(movieMatrixSummary.mean())) print("Movie factors variance: {}".format(movieMatrixSummary.variance())) print("User factors mean: {}".format(userMatrixSummary.mean())) print("User factors variance: {}".format(userMatrixSummary.variance()))

第四步,训练模型-KMeans聚类

from pyspark.mllib.clustering import KMeans numClusters = 5 numIterations = 10 #训练产品特征数据的聚类 ,查看下聚类中心 movieClusterModel = KMeans.train(rdd=movieVectors, k=numClusters,maxIterations=numIterations) movieClusterModel.clusterCenters #预测聚类中心 predictions = movieClusterModel.predict(movieVectors)

第五步,解释聚类含义

#解释聚类的含义 import math def computeDistance(v1,v2): v = v1-v2 return v.dot(v) titlesWithFactors = titlesAndGenres.join(movieFactors) #查看电影的聚类情况 def movieAssignedCluster(line): (id,((title,genres), vector)) = line pred = movieClusterModel.predict(vector) clusterCentre = movieClusterModel.clusterCenters[pred] dist = computeDistance(Vectors.dense(clusterCentre),Vectors.dense(vector)) return (id, title, genres, pred, dist) moviesAssigned = titlesWithFactors.map(lambda line:movieAssignedCluster(line)) clusterAssignments = moviesAssigned.groupBy(lambda x:x[3]).collectAsMap() #输出每个聚类中,距离聚类中心最近的Top电影 for (k,v) in clusterAssignments.items(): print('Cluster %d:'%k) m = sorted([(x[1],x[2],x[4]) for x in v], key=lambda x:x[-1], reverse=False) print([x for x in m[:20]]) print("==========")

聚类0中的电影基本是Drama类型的,偏向爱情、浪漫的题材。

第六步,拆分训练集、测试集,调聚类中心参数K,并评价聚类模型

#拆分训练集、测试集 trainTestSplitMovies = movieVectors.randomSplit(np.array([0.6, 0.4]),123) trainMovies = trainTestSplitMovies[0] testMovies = trainTestSplitMovies[1] print('Movie clustering cross-validation:') #调参-聚类中心K for k in [2, 3, 4, 5, 10, 20,30,40,50]: model = KMeans.train(rdd=trainMovies, k=k,maxIterations=numIterations) cost = model.computeCost(testMovies) print('WCSS for k={} is {:.4f}'.format(k,cost))

随着K的增大,WCSS减小,但到K=20后,WCSS的减小趋势变缓。所以,结合需求的同时,K取在20以内比较好。

Done


作者:Just Jump



电影 聚类分析 pyspark 聚类

需要 登录 后方可回复, 如果你还没有账号请 注册新账号