二手车交易价格预测学习笔记 -- Task4

Tina ·
更新时间:2024-09-21
· 891 次阅读

赛题:零基础入门数据挖掘 - 二手车交易价格预测
地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

建模与调参 常用方法 pandas 减少内存用量 df.memory_usage() 将数值型列类型细分,str列转换为category类型 截距 intercept 权重 coef 排序 sorted() 指定排序的列 key=lambda x:x[1] 降序 reverse=True 互换names和coef的位置 dict(zip()) matplotlib.pyplot 散点图 plt.scatter() 图例 plt.legend numpy 返回均匀间隔的数字 np.linspace() y轴的最大最小坐标 plt.ylim() 填充两个函数曲线之间的部分 plt.fill_between() 回归模型 Ridge & Lasso python 绝对值 abs() seaborn 画线段(折线图) sns.lineplot() 常用模型: from sklearn.linear_model import LinearRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.neural_network import MLPRegressor from xgboost.sklearn import XGBRegressor from lightgbm.sklearn import LGBMRegressor 小结

线性回归模型的数据集要尽量调整成正态分布;

用时间靠前的4/5样本当作训练集,时间靠后的1/5当作验证集,结果和五折交叉验证差距不大;

线性回归,Ridge和Lasso模型对比,前两个较好,但Ridge的coef参数很多较大,抗扰动弱;

常用模型中,随机森林模型表现最好,LGBM二好;

调参 ## LGB的参数集合: objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair'] num_leaves = [3,5,10,15,20,40, 55] max_depth = [3,5,10,15,20,40, 55] bagging_fraction = [] feature_fraction = [] drop_rate = [] 贪心调参 best_obj = dict() for obj in objective: model = LGBMRegressor(objective=obj) score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) best_obj[obj] = score best_leaves = dict() for leaves in num_leaves: model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves) score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) best_leaves[leaves] = score best_depth = dict() for depth in max_depth: model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0], max_depth=depth) score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) best_depth[depth] = score Grid Search调参 from sklearn.model_selection import GridSearchCV parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth} model = LGBMRegressor() clf = GridSearchCV(model, parameters, cv=5) clf = clf.fit(train_X, train_y) clf.best_params_ model = LGBMRegressor(objective='regression', num_leaves=55, max_depth=15) np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) 贝叶斯调参 from bayes_opt import BayesianOptimization def rf_cv(num_leaves, max_depth, subsample, min_child_samples): val = cross_val_score( LGBMRegressor(objective = 'regression_l1', num_leaves=int(num_leaves), max_depth=int(max_depth), subsample = subsample, min_child_samples = int(min_child_samples) ), X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error) ).mean() return 1 - val rf_bo = BayesianOptimization( rf_cv, { 'num_leaves': (2, 100), 'max_depth': (2, 100), 'subsample': (0.1, 1), 'min_child_samples' : (2, 100) } ) rf_bo.maximize() 1 - rf_bo.max['target']
作者:weixin_45727892



学习笔记 二手车 学习

需要 登录 后方可回复, 如果你还没有账号请 注册新账号