二手车交易价格预测学习笔记 -- Task4

Tina ·

更新时间:2024-11-14

· 891 次阅读

赛题：零基础入门数据挖掘 - 二手车交易价格预测
地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

建模与调参 常用方法 pandas 减少内存用量 df.memory_usage() 将数值型列类型细分，str列转换为category类型截距 intercept 权重 coef 排序 sorted() 指定排序的列 key=lambda x:x[1] 降序 reverse=True 互换names和coef的位置 dict(zip()) matplotlib.pyplot 散点图 plt.scatter() 图例 plt.legend numpy 返回均匀间隔的数字 np.linspace() y轴的最大最小坐标 plt.ylim() 填充两个函数曲线之间的部分 plt.fill_between() 回归模型 Ridge & Lasso python 绝对值 abs() seaborn 画线段(折线图) sns.lineplot() 常用模型：

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

小结

线性回归模型的数据集要尽量调整成正态分布；

用时间靠前的4/5样本当作训练集，时间靠后的1/5当作验证集，结果和五折交叉验证差距不大；

线性回归，Ridge和Lasso模型对比，前两个较好，但Ridge的coef参数很多较大，抗扰动弱；

常用模型中，随机森林模型表现最好，LGBM二好；

调参

 ## LGB的参数集合：
objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']
num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

贪心调参

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score

Grid Search调参

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)
clf.best_params_
model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

贝叶斯调参

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val
rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)
rf_bo.maximize()
1 - rf_bo.max['target']

作者：weixin_45727892

学习笔记二手车学习

1024 个赞