pandas合并、处理csv文件、随机抽样

Judith ·

更新时间:2024-11-14

· 659 次阅读

Merge/Join .csv (合并csv文件)

import pandas as pd
ratings = pd.read_csv('C:/Users/Can/Desktop/RS Project/ratings.csv')
movies= pd.read_csv('C:/Users/Can/Desktop/RS Project/movies.csv')

ratings.csv: users’ rating of movies
ratings
movies.csv: movies description
movies

Merge合并
把两个csv文件合并，显示出每位用户给电影的打分以及电影详情。

data = pd.merge(movies, ratings, on=['movieId'], how='left')
data = data[['userId', 'movieId', 'rating', 'timestamp', 'title', 'genres']]
# check if there is null value and list the row
# 检查是否有null值，并列出null值行
data[data.isnull().values==True]
# delete null value row -- 删除有null值行
data.dropna(axis=0, how='any', inplace=True)
# convert datatype of a single row -- 转换datatype至原文件格式
data['userId'] = data['userId'].astype('int')
data['timestamp'] = data['timestamp'].astype('int')
# export combined csv -- 导出合并csv文件
data.to_csv('data.csv')
# read combined csv -- 读取合并csv文件
combined = pd.read_csv('C:/Users/Can/Desktop/RS Project/data.csv')
# delete useless column -- 删除无用列
combined = combined.drop('Unnamed: 0', axis=1) # axis默认为0删除行，1为删除列
# sort by userId
combined.sort_values('userId', inplace=True)

合并后的csv
combined

对数据集随机抽样

# df.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
# n = number of rows(optional, cannot be used with frac) 抽取的行数；
# frac = fraction/proportion(optional, cannot be used with n) 抽取的比例；
# replace = Allow or disallow sampling of the same row more than once (boolean, default False) 是否为有放回抽样；
# weights (str or ndarray-like, optional) 权重
# random_state (int to use as interval, or call np.random.get_state(), optional) 整数作为间隔，或者调用np.random.get_state()
# axis = extract row or column (0->row, 1->column) 抽取行还是列（0是行，1是列）
# random select 10% from dataset
sample = combined.sample(frac=0.1, random_state=5, axis=0)
# export to csv file
sample.to_csv('userRating.csv')

作者：Can_9420

csv文件随机抽样 pandas csv

1024 个赞