Xgboost使用方法详解二

Izellah ·

更新时间:2024-11-13

· 809 次阅读

与Xgboost使用方法详解一的不同是：
1，数据为DataFrame模式（需转换为Dmatrix）
注意：整个方法的流程：读取数据pd.read_csv()----->切分数据train_test_split（）---------》数据转换成Dmatrix格式xgb.DMatrix()------》参数设定------》设定watchlist用于查看模型状态，train训练模型-------》使用模型预测predict------》判断准确率--------》模型存储

'''配合pandas DataFrame格式数据建模'''
import pandas as pd
import numpy as np
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
#基本例子，从csv文件中读取数据，做二分类
#用pandas读入数据
data = pd.read_csv('data/Pima-Indians-Diabetes.csv')
# 做数据切分
train, test = train_test_split(data)
# 转换成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
# 取出numpy array去初始化DMatrix对象
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
#参数设定
'''
max_depth:用于设置树的最大深度,默认为6，范围为：》1
eta:可以看作为学习率
为了防止过拟合，更新过程中用到的收缩步长，
再每次提升计算之后算法会直接获得新特征的权重。eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
取值范围为：[0,1]
silent：0表示输出信息， 1表示安静模式
subsample:表示观测的子样本的比率，将其设置未0.5以为着xgboost将随机抽取一半观测用于数的生长，这将有助于防止过拟合现象，范围未（0，1]
colsample_bytree:表示用于构造每棵树时变量的子样本比率,range: (0,1]
objective：这个参数定义需要被最小化的损失函数。
binary:logistic：二分类的逻辑回归，返回预测的概率
'''
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 设定watchlist用于查看模型状态
watchlist  = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 使用模型预测
preds = bst.predict(xgtest)
# 判断准确率
labels = xgtest.get_label()
print('错误类为%f' % \
       (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
#模型存储
bst.save_model('data/0002.model')

作者：小菜鸡一号

XGBoost使用 xgboost 方法

1024 个赞