饭店流量预测-多表关联+lightgbm

Ophira ·
更新时间:2024-11-10
· 671 次阅读

几点思考:
1、对pandas的使用就像是操作SQL语句, 总体说是增删改查, 但是涉及到联表, 涉及到分组,涉及到不同数据类型的操作,就有很多tricks在里面, 这些tricks是需要在不断的学习->使用中不断精进和掌握;
2、特征中包含datetime类型特征的时候, 可以依此分组构造新的时序特征,
(1) 是否是周末?
(2) 是一个月的第几天?
(3) 趋势特征
(4) 其他
3、值得fork的代码是:
(1) 数值类型特征异常值检测处理方法;
(2) 反应时间趋势特征的指数加权移动平均的方法;
(3) 时序特征统计量
4、不同的机器学习算法对特征的构造方式是有区别的, 比如KNN算法, 不需要对异常值做处理(异常值不敏感), 而线性回归,SVM等算法,需要处理异常值;比如决策树方法对特征量纲不敏感, 不需要特征归一化,而KNN需要; 比如像xgboost算法,不需要对空值(nan)做处理, 在训练的过程中会预测填充, 而简单的算法必须删除或填充异常值.(xgboost在百度的PaddlePaddle里面已经在配置环境默认安装了, 但是lightgbm, catboost等前延算法需要每次重置环镜自行安装. 还有我想请问一下可能看见这篇文章的各位大佬, Paddle里给免费提供的算力卡, 真是这样的吗(Tesla v100)?见图1, 我怎么感觉没那么好的性能呢).
在这里插入图片描述
5、硬件配置真的影响了在机器学习包括深度学习领域的科研信心, 验证一个模型要等几个小时(就是在高配的Paddle环境中, 天池的环境又总是排队, 昨天“天元”宣布开源了, 不知道能不能提供可靠的免费算力), 需要学习数据存储和操作的tricks, 只能软解了.
6、机器学习竞赛的三把利刃: 特征构造、建模调参(对数化、寻忧算法(贪心算法、网格搜索、贝叶斯寻优))、模型融合(stacking, voting, 随机森林+bagging+Adaboost).模型融合真的吃内存.

饭店流量预测

饭店来客数据

import pandas as pd air_visit = pd.read_csv('air_visit_data.csv') air_visit.head()
air_store_id visit_date visitors
0 air_ba937bf13d40fb24 2016-01-13 25
1 air_ba937bf13d40fb24 2016-01-14 32
2 air_ba937bf13d40fb24 2016-01-15 29
3 air_ba937bf13d40fb24 2016-01-16 22
4 air_ba937bf13d40fb24 2016-01-18 6
air_visit.index = pd.to_datetime(air_visit['visit_date']) air_visit.head()
air_store_id visit_date visitors
visit_date
2016-01-13 air_ba937bf13d40fb24 2016-01-13 25
2016-01-14 air_ba937bf13d40fb24 2016-01-14 32
2016-01-15 air_ba937bf13d40fb24 2016-01-15 29
2016-01-16 air_ba937bf13d40fb24 2016-01-16 22
2016-01-18 air_ba937bf13d40fb24 2016-01-18 6

按天来算

(1)对时间按天采样resample(‘1d’).sum() air_visit = air_visit.groupby(‘air_store_id’).apply(lambda g: g[‘visitors’].resample(‘1d’).sum()).reset_index() air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index() air_visit.head()
air_store_id visit_date visitors
0 air_00a91d42b08b08d9 2016-07-01 35
1 air_00a91d42b08b08d9 2016-07-02 9
2 air_00a91d42b08b08d9 2016-07-03 0
3 air_00a91d42b08b08d9 2016-07-04 20
4 air_00a91d42b08b08d9 2016-07-05 25
air_visit.info() RangeIndex: 296279 entries, 0 to 296278 Data columns (total 3 columns): air_store_id 296279 non-null object visit_date 296279 non-null datetime64[ns] visitors 296279 non-null int64 dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 6.8+ MB

缺失值填0

(2) 规范时间变量 dt.strftime(’%Y-%m-%d’) air_visit[‘visit_date’] = air_visit[‘visit_date’].dt.strftime(’%Y-%m-%d’) air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d') air_visit['was_nil'] = air_visit['visitors'].isnull() air_visit['visitors'].fillna(0, inplace=True) air_visit.head()
air_store_id visit_date visitors was_nil
0 air_00a91d42b08b08d9 2016-07-01 35 False
1 air_00a91d42b08b08d9 2016-07-02 9 False
2 air_00a91d42b08b08d9 2016-07-03 0 False
3 air_00a91d42b08b08d9 2016-07-04 20 False
4 air_00a91d42b08b08d9 2016-07-05 25 False

日历数据

date_info = pd.read_csv('date_info.csv') date_info.head()
calendar_date day_of_week holiday_flg
0 2016-01-01 Friday 1
1 2016-01-02 Saturday 1
2 2016-01-03 Sunday 1
3 2016-01-04 Monday 0
4 2016-01-05 Tuesday 0
(3) shift()操作对数据进行移动,可以观察前一天和后天是不是节假日。 date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True) date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0) date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0) date_info.head()
visit_date day_of_week is_holiday prev_day_is_holiday next_day_is_holiday
0 2016-01-01 Friday 1 0.0 1.0
1 2016-01-02 Saturday 1 1.0 1.0
2 2016-01-03 Sunday 1 1.0 0.0
3 2016-01-04 Monday 0 1.0 0.0
4 2016-01-05 Tuesday 0 0.0 0.0

地区数据

air_store_info = pd.read_csv('air_store_info.csv') air_store_info.head()
air_store_id air_genre_name air_area_name latitude longitude
0 air_0f0cdeee6c9bf3d7 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
1 air_7cc17a324ae5c7dc Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
2 air_fee8dcf4d619598e Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
3 air_a17f0778617c76e2 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852
4 air_83db5aff8f50478e Italian/French Tōkyō-to Minato-ku Shibakōen 35.658068 139.751599

测试集

(4) 字符串特征切片 str.slice(,) submission[‘air_store_id’] = submission[‘id’].str.slice(0, 20) import numpy as np submission = pd.read_csv('sample_sub.csv') submission['air_store_id'] = submission['id'].str.slice(0, 20) submission['visit_date'] = submission['id'].str.slice(21) submission['is_test'] = True # 标志位 submission['visitors'] = np.nan submission['test_number'] = range(len(submission)) submission.head()
id visitors air_store_id visit_date is_test test_number
0 air_00a91d42b08b08d9_2017-04-23 NaN air_00a91d42b08b08d9 2017-04-23 True 0
1 air_00a91d42b08b08d9_2017-04-24 NaN air_00a91d42b08b08d9 2017-04-24 True 1
2 air_00a91d42b08b08d9_2017-04-25 NaN air_00a91d42b08b08d9 2017-04-25 True 2
3 air_00a91d42b08b08d9_2017-04-26 NaN air_00a91d42b08b08d9 2017-04-26 True 3
4 air_00a91d42b08b08d9_2017-04-27 NaN air_00a91d42b08b08d9 2017-04-27 True 4

所有数据信息汇总

data = pd.concat((air_visit, submission.drop('id', axis='columns'))) data.head() /Users/liu/TM/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. """Entry point for launching an IPython kernel.
air_store_id is_test test_number visit_date visitors was_nil
0 air_00a91d42b08b08d9 NaN NaN 2016-07-01 35.0 False
1 air_00a91d42b08b08d9 NaN NaN 2016-07-02 9.0 False
2 air_00a91d42b08b08d9 NaN NaN 2016-07-03 0.0 False
3 air_00a91d42b08b08d9 NaN NaN 2016-07-04 20.0 False
4 air_00a91d42b08b08d9 NaN NaN 2016-07-05 25.0 False
data.shape (328298, 6) data.isnull().sum() air_store_id 0 is_test 296279 test_number 296279 visit_date 0 visitors 32019 was_nil 32019 dtype: int64 data['is_test'].fillna(False, inplace=True) data = pd.merge(left=data, right=date_info, on='visit_date', how='left') data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left') data['visitors'] = data['visitors'].astype(float) data.head()
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude
0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595
1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595
2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595
3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595
4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595
import missingno as msno msno.bar(data)

在这里插入图片描述

拿到天气数据

import glob weather_dfs = [] for path in glob.glob('./Weather/*.csv'): weather_df = pd.read_csv(path) weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv') weather_dfs.append(weather_df) weather = pd.concat(weather_dfs, axis='rows') weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True) weather.head()
visit_date avg_temperature high_temperature low_temperature precipitation hours_sunlight solar_radiation deepest_snowfall total_snowfall avg_wind_speed avg_vapor_pressure avg_local_pressure avg_humidity avg_sea_pressure cloud_cover station_id
0 2016-01-01 20.5 22.4 17.5 0.0 0.6 NaN NaN NaN 6.3 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara
1 2016-01-02 23.5 26.2 21.2 5.0 3.6 NaN NaN NaN 4.7 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara
2 2016-01-03 21.7 23.7 20.2 11.0 0.0 NaN NaN NaN 2.8 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara
3 2016-01-04 21.6 23.8 20.4 11.0 0.1 NaN NaN NaN 3.3 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara
4 2016-01-05 22.1 24.6 20.5 35.5 0.0 NaN NaN NaN 2.4 NaN NaN NaN NaN NaN ./Weather/okinawa__ohara-kana__oohara

用各个小地方数据求出平均气温

(5) 以某一列为分组, 对其他列进行统计groupby()[[’’,’’]].mean() means = weather.groupby(‘visit_date’)[[‘avg_temperature’, ‘precipitation’]].mean().reset_index() means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index() means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True) means.head()
visit_date global_avg_temperature global_precipitation
0 2016-01-01 2.868353 0.564662
1 2016-01-02 5.279225 2.341998
2 2016-01-03 6.589978 1.750616
3 2016-01-04 5.857883 1.644946
4 2016-01-05 4.556850 3.193625
means.visit_date.nunique() 517 weather.visit_date.nunique() 517 weather = pd.merge(left=weather, right=means, on='visit_date', how='left') weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True) weather['precipitation'].fillna(weather['global_precipitation'], inplace=True) weather[['visit_date', 'avg_temperature', 'precipitation']].head()
visit_date avg_temperature precipitation
0 2016-01-01 20.5 0.0
1 2016-01-02 23.5 5.0
2 2016-01-03 21.7 11.0
3 2016-01-04 21.6 11.0
4 2016-01-05 22.1 35.5

信息数据

data.info() DatetimeIndex: 328298 entries, 2016-07-01 to 2017-05-31 Data columns (total 15 columns): air_store_id 328298 non-null object is_test 328298 non-null bool test_number 32019 non-null float64 visit_date 328298 non-null datetime64[ns] visitors 296279 non-null float64 was_nil 296279 non-null object day_of_week 328298 non-null object is_holiday 328298 non-null int64 prev_day_is_holiday 328298 non-null float64 next_day_is_holiday 328298 non-null float64 air_genre_name 328298 non-null object air_area_name 328298 non-null object latitude 328298 non-null float64 longitude 328298 non-null float64 is_weekend 328298 non-null int64 dtypes: bool(1), datetime64[ns](1), float64(6), int64(2), object(5) memory usage: 37.9+ MB data.reset_index(drop=True, inplace=True) #data.sort_values(['air_store_id', 'visit_date'], inplace=True) #data.head() data.sort_values(['air_store_id', 'visit_date'], inplace=True) data.head()
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude is_weekend
0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0
1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1
2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1
3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0
4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0
(6) 异常点问题,数据中存在部分异常点,以正太分布为出发点,认为95%的是正常的,所以选择了1.96这个值。对异常点来规范,让特别大的点等于正常中最大的。 def find_outliers(series): return (series - series.mean()) > 1.96 * series.std() def cap_values(series): outliers = find_outliers(series) max_val = series[~outliers].max() series[outliers] = max_val return series stores = data.groupby('air_store_id') data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values data['visitors_capped_log1p'] = np.log1p(data['visitors_capped']) data.head()
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude is_weekend is_outlier visitors_capped visitors_capped_log1p
0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 35.0 3.583519
1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 9.0 2.302585
2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 0.0 0.000000
3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 20.0 3.044522
4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 25.0 3.258097
data.isnull().sum() air_store_id 0 is_test 0 test_number 296279 visit_date 0 visitors 32019 was_nil 32019 day_of_week 0 is_holiday 0 prev_day_is_holiday 0 next_day_is_holiday 0 air_genre_name 0 air_area_name 0 latitude 0 longitude 0 is_weekend 0 is_outlier 0 visitors_capped 32019 visitors_capped_log1p 32019 dtype: int64 日期特征 (7) 添加“是否是周末”, “一个月的第几天”两个特征 data[‘is_weekend’] = data[‘day_of_week’].isin([[‘Saturday’, ‘Sunday’]]).astype(int) data[‘day_of_month’] = data[‘visit_date’].dt.day data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int) data['day_of_month'] = data['visit_date'].dt.day data.head()
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude is_weekend is_outlier visitors_capped visitors_capped_log1p day_of_month
0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 35.0 3.583519 1
1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 9.0 2.302585 2
2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 False Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 1 False 0.0 0.000000 3
3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 20.0 3.044522 4
4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 0 False 25.0 3.258097 5
(8) 指数加权移动平均(Exponential Weighted Moving Average),反应时间序列变换趋势,需要我们给定alpha值,这里我们来优化求一个最合适的。 from scipy import optimize def calc_shifted_ewm(series, alpha, adjust=True): return series.shift().ewm(alpha=alpha, adjust=adjust).mean() def find_best_signal(series, adjust=False, eps=10e-5): def f(alpha): shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust) corr = np.mean(np.power(series - shifted_ewm, 2)) return corr res = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)]) return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust) roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped'])) data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped'])) data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p'])) data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p'])) data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values (9) 尽可能多的提取时间序列信息 def extract_precedent_statistics(df, on, group_by): df.sort_values(group_by + ['visit_date'], inplace=True) groups = df.groupby(group_by, sort=False) stats = { 'mean': [], 'median': [], 'std': [], 'count': [], 'max': [], 'min': [] } exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75] stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas}) for _, group in groups: shift = group[on].shift() roll = shift.rolling(window=len(group), min_periods=1) stats['mean'].extend(roll.mean()) stats['median'].extend(roll.median()) stats['std'].extend(roll.std()) stats['count'].extend(roll.count()) stats['max'].extend(roll.max()) stats['min'].extend(roll.min()) for alpha in exp_alphas: exp = shift.ewm(alpha=alpha, adjust=False) stats['exp_{}_mean'.format(alpha)].extend(exp.mean()) suffix = '_&_'.join(group_by) for stat_name, values in stats.items(): df['{}_{}_by_{}'.format(on, stat_name, suffix)] = values extract_precedent_statistics( df=data, on='visitors_capped', group_by=['air_store_id', 'day_of_week'] ) extract_precedent_statistics( df=data, on='visitors_capped', group_by=['air_store_id', 'is_weekend'] ) extract_precedent_statistics( df=data, on='visitors_capped', group_by=['air_store_id'] ) extract_precedent_statistics( df=data, on='visitors_capped_log1p', group_by=['air_store_id', 'day_of_week'] ) extract_precedent_statistics( df=data, on='visitors_capped_log1p', group_by=['air_store_id', 'is_weekend'] ) extract_precedent_statistics( df=data, on='visitors_capped_log1p', group_by=['air_store_id'] ) data.sort_values(['air_store_id', 'visit_date']).head()
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday ... visitors_capped_log1p_median_by_air_store_id visitors_capped_log1p_std_by_air_store_id visitors_capped_log1p_count_by_air_store_id visitors_capped_log1p_max_by_air_store_id visitors_capped_log1p_min_by_air_store_id visitors_capped_log1p_exp_0.1_mean_by_air_store_id visitors_capped_log1p_exp_0.25_mean_by_air_store_id visitors_capped_log1p_exp_0.3_mean_by_air_store_id visitors_capped_log1p_exp_0.5_mean_by_air_store_id visitors_capped_log1p_exp_0.75_mean_by_air_store_id
visit_date
2016-07-01 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN
2016-07-02 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 ... 3.583519 NaN 1.0 3.583519 3.583519 3.583519 3.583519 3.583519 3.583519 3.583519
2016-07-03 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 ... 2.943052 0.905757 2.0 3.583519 2.302585 3.455426 3.263285 3.199239 2.943052 2.622819
2016-07-04 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 ... 2.302585 1.815870 3.0 3.583519 0.000000 3.109883 2.447464 2.239467 1.471526 0.655705
2016-07-05 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 ... 2.673554 1.578354 4.0 3.583519 0.000000 3.103347 2.596729 2.480984 2.258024 2.447318

5 rows × 89 columns

(10) 对数据的某几列进行onehot编码: data = pd.get_dummies(data, columns=[‘day_of_week’, ‘air_genre_name’]) data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name']) data.head()

数据集划分

data['visitors_log1p'] = np.log1p(data['visitors']) train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)] test = data[data['is_test']].sort_values('test_number') to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil', 'is_outlier', 'visitors_capped', 'visitors', 'air_area_name', 'latitude', 'longitude', 'visitors_capped_log1p'] train = train.drop(to_drop, axis='columns') train = train.dropna() test = test.drop(to_drop, axis='columns') X_train = train.drop('visitors_log1p', axis='columns') X_test = test.drop('visitors_log1p', axis='columns') y_train = train['visitors_log1p'] X_train.head()
is_holiday prev_day_is_holiday next_day_is_holiday is_weekend day_of_month optimized_ewm_by_air_store_id_&_day_of_week optimized_ewm_by_air_store_id_&_is_weekend optimized_ewm_log1p_by_air_store_id_&_day_of_week optimized_ewm_log1p_by_air_store_id_&_is_weekend visitors_capped_mean_by_air_store_id_&_day_of_week ... air_genre_name_Dining bar air_genre_name_International cuisine air_genre_name_Italian/French air_genre_name_Izakaya air_genre_name_Japanese food air_genre_name_Karaoke/Party air_genre_name_Okonomiyaki/Monja/Teppanyaki air_genre_name_Other air_genre_name_Western food air_genre_name_Yakiniku/Korean food
visit_date
2016-07-15 0 0.0 0.0 0 15 35.000700 31.642520 3.588106 3.425707 38.5 ... 0 0 1 0 0 0 0 0 0 0
2016-07-16 0 0.0 0.0 1 16 9.061831 8.618812 2.302603 2.003579 10.0 ... 0 0 1 0 0 0 0 0 0 0
2016-07-19 0 1.0 0.0 0 19 24.841272 27.988385 3.252832 2.428565 24.5 ... 0 0 1 0 0 0 0 0 0 0
2016-07-20 0 0.0 0.0 0 20 29.198575 27.675525 3.412813 2.667124 32.5 ... 0 0 1 0 0 0 0 0 0 0
2016-07-21 0 0.0 0.0 0 21 32.710972 26.767268 3.537397 2.761626 31.0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 96 columns

y_train.head() visit_date 2016-07-15 3.367296 2016-07-16 1.791759 2016-07-19 3.258097 2016-07-20 2.995732 2016-07-21 3.871201 Name: visitors_log1p, dtype: float64 (11) 断言语句查看是不是哪还有问题 assert X_train.isnull().sum().sum() == 0 assert y_train.isnull().sum() == 0 assert len(X_train) == len(y_train) assert X_test.isnull().sum().sum() == 0 assert len(X_test) == 32019 assert X_train.isnull().sum().sum() == 0 assert y_train.isnull().sum() == 0 assert len(X_train) == len(y_train) assert X_test.isnull().sum().sum() == 0 assert len(X_test) == 32019 (12) lightgbm建模 import lightgbm as lgbm from sklearn import metrics from sklearn import model_selection np.random.seed(42) model = lgbm.LGBMRegressor( objective='regression', max_depth=5, num_leaves=25, learning_rate=0.007, n_estimators=1000, min_child_samples=80, subsample=0.8, colsample_bytree=1, reg_alpha=0, reg_lambda=0, random_state=np.random.randint(10e6) ) n_splits = 6 cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42) val_scores = [0] * n_splits sub = submission['id'].to_frame() sub['visitors'] = 0 feature_importances = pd.DataFrame(index=X_train.columns) for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)): X_fit = X_train.iloc[fit_idx] y_fit = y_train.iloc[fit_idx] X_val = X_train.iloc[val_idx] y_val = y_train.iloc[val_idx] model.fit( X_fit, y_fit, eval_set=[(X_fit, y_fit), (X_val, y_val)], eval_names=('fit', 'val'), eval_metric='l2', early_stopping_rounds=200, feature_name=X_fit.columns.tolist(), verbose=False ) val_scores[i] = np.sqrt(model.best_score_['val']['l2']) sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_) feature_importances[i] = model.feature_importances_ print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i])) sub['visitors'] /= n_splits sub['visitors'] = np.expm1(sub['visitors']) val_mean = np.mean(val_scores) val_std = np.std(val_scores) print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std)) Fold 1 RMSLE: 0.48936 Fold 2 RMSLE: 0.49091 Fold 3 RMSLE: 0.48654 Fold 4 RMSLE: 0.48831 Fold 5 RMSLE: 0.48788 Fold 6 RMSLE: 0.48706 Local RMSLE: 0.48834 (±0.00146)

输出结果

sub.to_csv('result.csv', index=False) import pandas as pd df = pd.read_csv('result.csv') df.head()
id visitors
0 air_00a91d42b08b08d9_2017-04-23 4.340348
1 air_00a91d42b08b08d9_2017-04-24 22.739363
2 air_00a91d42b08b08d9_2017-04-25 29.535532
3 air_00a91d42b08b08d9_2017-04-26 29.319551
4 air_00a91d42b08b08d9_2017-04-27 31.838669

代码部分参考:
https://edu.aliyun.com/course/1915?spm=a2c6h.12873581.0.0.6d6c56815vyMWI


作者:sapienst



多表关联 lightgbm 流量

需要 登录 后方可回复, 如果你还没有账号请 注册新账号
相关文章