几点思考:
1、对pandas的使用就像是操作SQL语句, 总体说是增删改查, 但是涉及到联表, 涉及到分组,涉及到不同数据类型的操作,就有很多tricks在里面, 这些tricks是需要在不断的学习->使用中不断精进和掌握;
2、特征中包含datetime类型特征的时候, 可以依此分组构造新的时序特征,
(1) 是否是周末?
(2) 是一个月的第几天?
(3) 趋势特征
(4) 其他
3、值得fork的代码是:
(1) 数值类型特征异常值检测处理方法;
(2) 反应时间趋势特征的指数加权移动平均的方法;
(3) 时序特征统计量
4、不同的机器学习算法对特征的构造方式是有区别的, 比如KNN算法, 不需要对异常值做处理(异常值不敏感), 而线性回归,SVM等算法,需要处理异常值;比如决策树方法对特征量纲不敏感, 不需要特征归一化,而KNN需要; 比如像xgboost算法,不需要对空值(nan)做处理, 在训练的过程中会预测填充, 而简单的算法必须删除或填充异常值.(xgboost在百度的PaddlePaddle里面已经在配置环境默认安装了, 但是lightgbm, catboost等前延算法需要每次重置环镜自行安装. 还有我想请问一下可能看见这篇文章的各位大佬, Paddle里给免费提供的算力卡, 真是这样的吗(Tesla v100)?见图1, 我怎么感觉没那么好的性能呢).
5、硬件配置真的影响了在机器学习包括深度学习领域的科研信心, 验证一个模型要等几个小时(就是在高配的Paddle环境中, 天池的环境又总是排队, 昨天“天元”宣布开源了, 不知道能不能提供可靠的免费算力), 需要学习数据存储和操作的tricks, 只能软解了.
6、机器学习竞赛的三把利刃: 特征构造、建模调参(对数化、寻忧算法(贪心算法、网格搜索、贝叶斯寻优))、模型融合(stacking, voting, 随机森林+bagging+Adaboost).模型融合真的吃内存.
饭店流量预测
饭店来客数据
import pandas as pd
air_visit = pd.read_csv('air_visit_data.csv')
air_visit.head()
|
air_store_id |
visit_date |
visitors |
0 |
air_ba937bf13d40fb24 |
2016-01-13 |
25 |
1 |
air_ba937bf13d40fb24 |
2016-01-14 |
32 |
2 |
air_ba937bf13d40fb24 |
2016-01-15 |
29 |
3 |
air_ba937bf13d40fb24 |
2016-01-16 |
22 |
4 |
air_ba937bf13d40fb24 |
2016-01-18 |
6 |
air_visit.index = pd.to_datetime(air_visit['visit_date'])
air_visit.head()
|
air_store_id |
visit_date |
visitors |
visit_date |
|
|
|
2016-01-13 |
air_ba937bf13d40fb24 |
2016-01-13 |
25 |
2016-01-14 |
air_ba937bf13d40fb24 |
2016-01-14 |
32 |
2016-01-15 |
air_ba937bf13d40fb24 |
2016-01-15 |
29 |
2016-01-16 |
air_ba937bf13d40fb24 |
2016-01-16 |
22 |
2016-01-18 |
air_ba937bf13d40fb24 |
2016-01-18 |
6 |
按天来算
(1)对时间按天采样resample(‘1d’).sum()
air_visit = air_visit.groupby(‘air_store_id’).apply(lambda g: g[‘visitors’].resample(‘1d’).sum()).reset_index()
air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index()
air_visit.head()
|
air_store_id |
visit_date |
visitors |
0 |
air_00a91d42b08b08d9 |
2016-07-01 |
35 |
1 |
air_00a91d42b08b08d9 |
2016-07-02 |
9 |
2 |
air_00a91d42b08b08d9 |
2016-07-03 |
0 |
3 |
air_00a91d42b08b08d9 |
2016-07-04 |
20 |
4 |
air_00a91d42b08b08d9 |
2016-07-05 |
25 |
air_visit.info()
RangeIndex: 296279 entries, 0 to 296278
Data columns (total 3 columns):
air_store_id 296279 non-null object
visit_date 296279 non-null datetime64[ns]
visitors 296279 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 6.8+ MB
缺失值填0
(2) 规范时间变量 dt.strftime(’%Y-%m-%d’)
air_visit[‘visit_date’] = air_visit[‘visit_date’].dt.strftime(’%Y-%m-%d’)
air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d')
air_visit['was_nil'] = air_visit['visitors'].isnull()
air_visit['visitors'].fillna(0, inplace=True)
air_visit.head()
|
air_store_id |
visit_date |
visitors |
was_nil |
0 |
air_00a91d42b08b08d9 |
2016-07-01 |
35 |
False |
1 |
air_00a91d42b08b08d9 |
2016-07-02 |
9 |
False |
2 |
air_00a91d42b08b08d9 |
2016-07-03 |
0 |
False |
3 |
air_00a91d42b08b08d9 |
2016-07-04 |
20 |
False |
4 |
air_00a91d42b08b08d9 |
2016-07-05 |
25 |
False |
日历数据
date_info = pd.read_csv('date_info.csv')
date_info.head()
|
calendar_date |
day_of_week |
holiday_flg |
0 |
2016-01-01 |
Friday |
1 |
1 |
2016-01-02 |
Saturday |
1 |
2 |
2016-01-03 |
Sunday |
1 |
3 |
2016-01-04 |
Monday |
0 |
4 |
2016-01-05 |
Tuesday |
0 |
(3) shift()操作对数据进行移动,可以观察前一天和后天是不是节假日。
date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True)
date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0)
date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0)
date_info.head()
|
visit_date |
day_of_week |
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
0 |
2016-01-01 |
Friday |
1 |
0.0 |
1.0 |
1 |
2016-01-02 |
Saturday |
1 |
1.0 |
1.0 |
2 |
2016-01-03 |
Sunday |
1 |
1.0 |
0.0 |
3 |
2016-01-04 |
Monday |
0 |
1.0 |
0.0 |
4 |
2016-01-05 |
Tuesday |
0 |
0.0 |
0.0 |
地区数据
air_store_info = pd.read_csv('air_store_info.csv')
air_store_info.head()
|
air_store_id |
air_genre_name |
air_area_name |
latitude |
longitude |
0 |
air_0f0cdeee6c9bf3d7 |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
1 |
air_7cc17a324ae5c7dc |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
2 |
air_fee8dcf4d619598e |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
3 |
air_a17f0778617c76e2 |
Italian/French |
Hyōgo-ken Kōbe-shi Kumoidōri |
34.695124 |
135.197852 |
4 |
air_83db5aff8f50478e |
Italian/French |
Tōkyō-to Minato-ku Shibakōen |
35.658068 |
139.751599 |
测试集
(4) 字符串特征切片 str.slice(,)
submission[‘air_store_id’] = submission[‘id’].str.slice(0, 20)
import numpy as np
submission = pd.read_csv('sample_sub.csv')
submission['air_store_id'] = submission['id'].str.slice(0, 20)
submission['visit_date'] = submission['id'].str.slice(21)
submission['is_test'] = True # 标志位
submission['visitors'] = np.nan
submission['test_number'] = range(len(submission))
submission.head()
|
id |
visitors |
air_store_id |
visit_date |
is_test |
test_number |
0 |
air_00a91d42b08b08d9_2017-04-23 |
NaN |
air_00a91d42b08b08d9 |
2017-04-23 |
True |
0 |
1 |
air_00a91d42b08b08d9_2017-04-24 |
NaN |
air_00a91d42b08b08d9 |
2017-04-24 |
True |
1 |
2 |
air_00a91d42b08b08d9_2017-04-25 |
NaN |
air_00a91d42b08b08d9 |
2017-04-25 |
True |
2 |
3 |
air_00a91d42b08b08d9_2017-04-26 |
NaN |
air_00a91d42b08b08d9 |
2017-04-26 |
True |
3 |
4 |
air_00a91d42b08b08d9_2017-04-27 |
NaN |
air_00a91d42b08b08d9 |
2017-04-27 |
True |
4 |
所有数据信息汇总
data = pd.concat((air_visit, submission.drop('id', axis='columns')))
data.head()
/Users/liu/TM/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
|
air_store_id |
is_test |
test_number |
visit_date |
visitors |
was_nil |
0 |
air_00a91d42b08b08d9 |
NaN |
NaN |
2016-07-01 |
35.0 |
False |
1 |
air_00a91d42b08b08d9 |
NaN |
NaN |
2016-07-02 |
9.0 |
False |
2 |
air_00a91d42b08b08d9 |
NaN |
NaN |
2016-07-03 |
0.0 |
False |
3 |
air_00a91d42b08b08d9 |
NaN |
NaN |
2016-07-04 |
20.0 |
False |
4 |
air_00a91d42b08b08d9 |
NaN |
NaN |
2016-07-05 |
25.0 |
False |
data.shape
(328298, 6)
data.isnull().sum()
air_store_id 0
is_test 296279
test_number 296279
visit_date 0
visitors 32019
was_nil 32019
dtype: int64
data['is_test'].fillna(False, inplace=True)
data = pd.merge(left=data, right=date_info, on='visit_date', how='left')
data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left')
data['visitors'] = data['visitors'].astype(float)
data.head()
|
air_store_id |
is_test |
test_number |
visit_date |
visitors |
was_nil |
day_of_week |
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
air_genre_name |
air_area_name |
latitude |
longitude |
0 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-01 |
35.0 |
False |
Friday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-02 |
9.0 |
False |
Saturday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
2 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-03 |
0.0 |
False |
Sunday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
3 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-04 |
20.0 |
False |
Monday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
4 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-05 |
25.0 |
False |
Tuesday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
import missingno as msno
msno.bar(data)
拿到天气数据
import glob
weather_dfs = []
for path in glob.glob('./Weather/*.csv'):
weather_df = pd.read_csv(path)
weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv')
weather_dfs.append(weather_df)
weather = pd.concat(weather_dfs, axis='rows')
weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True)
weather.head()
|
visit_date |
avg_temperature |
high_temperature |
low_temperature |
precipitation |
hours_sunlight |
solar_radiation |
deepest_snowfall |
total_snowfall |
avg_wind_speed |
avg_vapor_pressure |
avg_local_pressure |
avg_humidity |
avg_sea_pressure |
cloud_cover |
station_id |
0 |
2016-01-01 |
20.5 |
22.4 |
17.5 |
0.0 |
0.6 |
NaN |
NaN |
NaN |
6.3 |
NaN |
NaN |
NaN |
NaN |
NaN |
./Weather/okinawa__ohara-kana__oohara |
1 |
2016-01-02 |
23.5 |
26.2 |
21.2 |
5.0 |
3.6 |
NaN |
NaN |
NaN |
4.7 |
NaN |
NaN |
NaN |
NaN |
NaN |
./Weather/okinawa__ohara-kana__oohara |
2 |
2016-01-03 |
21.7 |
23.7 |
20.2 |
11.0 |
0.0 |
NaN |
NaN |
NaN |
2.8 |
NaN |
NaN |
NaN |
NaN |
NaN |
./Weather/okinawa__ohara-kana__oohara |
3 |
2016-01-04 |
21.6 |
23.8 |
20.4 |
11.0 |
0.1 |
NaN |
NaN |
NaN |
3.3 |
NaN |
NaN |
NaN |
NaN |
NaN |
./Weather/okinawa__ohara-kana__oohara |
4 |
2016-01-05 |
22.1 |
24.6 |
20.5 |
35.5 |
0.0 |
NaN |
NaN |
NaN |
2.4 |
NaN |
NaN |
NaN |
NaN |
NaN |
./Weather/okinawa__ohara-kana__oohara |
用各个小地方数据求出平均气温
(5) 以某一列为分组, 对其他列进行统计groupby()[[’’,’’]].mean()
means = weather.groupby(‘visit_date’)[[‘avg_temperature’, ‘precipitation’]].mean().reset_index()
means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index()
means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True)
means.head()
|
visit_date |
global_avg_temperature |
global_precipitation |
0 |
2016-01-01 |
2.868353 |
0.564662 |
1 |
2016-01-02 |
5.279225 |
2.341998 |
2 |
2016-01-03 |
6.589978 |
1.750616 |
3 |
2016-01-04 |
5.857883 |
1.644946 |
4 |
2016-01-05 |
4.556850 |
3.193625 |
means.visit_date.nunique()
517
weather.visit_date.nunique()
517
weather = pd.merge(left=weather, right=means, on='visit_date', how='left')
weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True)
weather['precipitation'].fillna(weather['global_precipitation'], inplace=True)
weather[['visit_date', 'avg_temperature', 'precipitation']].head()
|
visit_date |
avg_temperature |
precipitation |
0 |
2016-01-01 |
20.5 |
0.0 |
1 |
2016-01-02 |
23.5 |
5.0 |
2 |
2016-01-03 |
21.7 |
11.0 |
3 |
2016-01-04 |
21.6 |
11.0 |
4 |
2016-01-05 |
22.1 |
35.5 |
信息数据
data.info()
DatetimeIndex: 328298 entries, 2016-07-01 to 2017-05-31
Data columns (total 15 columns):
air_store_id 328298 non-null object
is_test 328298 non-null bool
test_number 32019 non-null float64
visit_date 328298 non-null datetime64[ns]
visitors 296279 non-null float64
was_nil 296279 non-null object
day_of_week 328298 non-null object
is_holiday 328298 non-null int64
prev_day_is_holiday 328298 non-null float64
next_day_is_holiday 328298 non-null float64
air_genre_name 328298 non-null object
air_area_name 328298 non-null object
latitude 328298 non-null float64
longitude 328298 non-null float64
is_weekend 328298 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(6), int64(2), object(5)
memory usage: 37.9+ MB
data.reset_index(drop=True, inplace=True)
#data.sort_values(['air_store_id', 'visit_date'], inplace=True)
#data.head()
data.sort_values(['air_store_id', 'visit_date'], inplace=True)
data.head()
|
air_store_id |
is_test |
test_number |
visit_date |
visitors |
was_nil |
day_of_week |
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
air_genre_name |
air_area_name |
latitude |
longitude |
is_weekend |
0 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-01 |
35.0 |
False |
Friday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
1 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-02 |
9.0 |
False |
Saturday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
2 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-03 |
0.0 |
False |
Sunday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
3 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-04 |
20.0 |
False |
Monday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
4 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-05 |
25.0 |
False |
Tuesday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
(6) 异常点问题,数据中存在部分异常点,以正太分布为出发点,认为95%的是正常的,所以选择了1.96这个值。对异常点来规范,让特别大的点等于正常中最大的。
def find_outliers(series):
return (series - series.mean()) > 1.96 * series.std()
def cap_values(series):
outliers = find_outliers(series)
max_val = series[~outliers].max()
series[outliers] = max_val
return series
stores = data.groupby('air_store_id')
data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values
data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values
data['visitors_capped_log1p'] = np.log1p(data['visitors_capped'])
data.head()
|
air_store_id |
is_test |
test_number |
visit_date |
visitors |
was_nil |
day_of_week |
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
air_genre_name |
air_area_name |
latitude |
longitude |
is_weekend |
is_outlier |
visitors_capped |
visitors_capped_log1p |
0 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-01 |
35.0 |
False |
Friday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
False |
35.0 |
3.583519 |
1 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-02 |
9.0 |
False |
Saturday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
False |
9.0 |
2.302585 |
2 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-03 |
0.0 |
False |
Sunday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
False |
0.0 |
0.000000 |
3 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-04 |
20.0 |
False |
Monday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
False |
20.0 |
3.044522 |
4 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-05 |
25.0 |
False |
Tuesday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
False |
25.0 |
3.258097 |
data.isnull().sum()
air_store_id 0
is_test 0
test_number 296279
visit_date 0
visitors 32019
was_nil 32019
day_of_week 0
is_holiday 0
prev_day_is_holiday 0
next_day_is_holiday 0
air_genre_name 0
air_area_name 0
latitude 0
longitude 0
is_weekend 0
is_outlier 0
visitors_capped 32019
visitors_capped_log1p 32019
dtype: int64
日期特征
(7) 添加“是否是周末”, “一个月的第几天”两个特征
data[‘is_weekend’] = data[‘day_of_week’].isin([[‘Saturday’, ‘Sunday’]]).astype(int)
data[‘day_of_month’] = data[‘visit_date’].dt.day
data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
data['day_of_month'] = data['visit_date'].dt.day
data.head()
|
air_store_id |
is_test |
test_number |
visit_date |
visitors |
was_nil |
day_of_week |
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
air_genre_name |
air_area_name |
latitude |
longitude |
is_weekend |
is_outlier |
visitors_capped |
visitors_capped_log1p |
day_of_month |
0 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-01 |
35.0 |
False |
Friday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
False |
35.0 |
3.583519 |
1 |
1 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-02 |
9.0 |
False |
Saturday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
False |
9.0 |
2.302585 |
2 |
2 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-03 |
0.0 |
False |
Sunday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
1 |
False |
0.0 |
0.000000 |
3 |
3 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-04 |
20.0 |
False |
Monday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
False |
20.0 |
3.044522 |
4 |
4 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-05 |
25.0 |
False |
Tuesday |
0 |
0.0 |
0.0 |
Italian/French |
Tōkyō-to Chiyoda-ku Kudanminami |
35.694003 |
139.753595 |
0 |
False |
25.0 |
3.258097 |
5 |
(8) 指数加权移动平均(Exponential Weighted Moving Average),反应时间序列变换趋势,需要我们给定alpha值,这里我们来优化求一个最合适的。
from scipy import optimize
def calc_shifted_ewm(series, alpha, adjust=True):
return series.shift().ewm(alpha=alpha, adjust=adjust).mean()
def find_best_signal(series, adjust=False, eps=10e-5):
def f(alpha):
shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust)
corr = np.mean(np.power(series - shifted_ewm, 2))
return corr
res = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)])
return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust)
roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
(9) 尽可能多的提取时间序列信息
def extract_precedent_statistics(df, on, group_by):
df.sort_values(group_by + ['visit_date'], inplace=True)
groups = df.groupby(group_by, sort=False)
stats = {
'mean': [],
'median': [],
'std': [],
'count': [],
'max': [],
'min': []
}
exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75]
stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas})
for _, group in groups:
shift = group[on].shift()
roll = shift.rolling(window=len(group), min_periods=1)
stats['mean'].extend(roll.mean())
stats['median'].extend(roll.median())
stats['std'].extend(roll.std())
stats['count'].extend(roll.count())
stats['max'].extend(roll.max())
stats['min'].extend(roll.min())
for alpha in exp_alphas:
exp = shift.ewm(alpha=alpha, adjust=False)
stats['exp_{}_mean'.format(alpha)].extend(exp.mean())
suffix = '_&_'.join(group_by)
for stat_name, values in stats.items():
df['{}_{}_by_{}'.format(on, stat_name, suffix)] = values
extract_precedent_statistics(
df=data,
on='visitors_capped',
group_by=['air_store_id', 'day_of_week']
)
extract_precedent_statistics(
df=data,
on='visitors_capped',
group_by=['air_store_id', 'is_weekend']
)
extract_precedent_statistics(
df=data,
on='visitors_capped',
group_by=['air_store_id']
)
extract_precedent_statistics(
df=data,
on='visitors_capped_log1p',
group_by=['air_store_id', 'day_of_week']
)
extract_precedent_statistics(
df=data,
on='visitors_capped_log1p',
group_by=['air_store_id', 'is_weekend']
)
extract_precedent_statistics(
df=data,
on='visitors_capped_log1p',
group_by=['air_store_id']
)
data.sort_values(['air_store_id', 'visit_date']).head()
|
air_store_id |
is_test |
test_number |
visit_date |
visitors |
was_nil |
day_of_week |
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
... |
visitors_capped_log1p_median_by_air_store_id |
visitors_capped_log1p_std_by_air_store_id |
visitors_capped_log1p_count_by_air_store_id |
visitors_capped_log1p_max_by_air_store_id |
visitors_capped_log1p_min_by_air_store_id |
visitors_capped_log1p_exp_0.1_mean_by_air_store_id |
visitors_capped_log1p_exp_0.25_mean_by_air_store_id |
visitors_capped_log1p_exp_0.3_mean_by_air_store_id |
visitors_capped_log1p_exp_0.5_mean_by_air_store_id |
visitors_capped_log1p_exp_0.75_mean_by_air_store_id |
visit_date |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2016-07-01 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-01 |
35.0 |
False |
Friday |
0 |
0.0 |
0.0 |
... |
NaN |
NaN |
0.0 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
2016-07-02 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-02 |
9.0 |
False |
Saturday |
0 |
0.0 |
0.0 |
... |
3.583519 |
NaN |
1.0 |
3.583519 |
3.583519 |
3.583519 |
3.583519 |
3.583519 |
3.583519 |
3.583519 |
2016-07-03 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-03 |
0.0 |
True |
Sunday |
0 |
0.0 |
0.0 |
... |
2.943052 |
0.905757 |
2.0 |
3.583519 |
2.302585 |
3.455426 |
3.263285 |
3.199239 |
2.943052 |
2.622819 |
2016-07-04 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-04 |
20.0 |
False |
Monday |
0 |
0.0 |
0.0 |
... |
2.302585 |
1.815870 |
3.0 |
3.583519 |
0.000000 |
3.109883 |
2.447464 |
2.239467 |
1.471526 |
0.655705 |
2016-07-05 |
air_00a91d42b08b08d9 |
False |
NaN |
2016-07-05 |
25.0 |
False |
Tuesday |
0 |
0.0 |
0.0 |
... |
2.673554 |
1.578354 |
4.0 |
3.583519 |
0.000000 |
3.103347 |
2.596729 |
2.480984 |
2.258024 |
2.447318 |
5 rows × 89 columns
(10) 对数据的某几列进行onehot编码: data = pd.get_dummies(data, columns=[‘day_of_week’, ‘air_genre_name’])
data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name'])
data.head()
数据集划分
data['visitors_log1p'] = np.log1p(data['visitors'])
train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)]
test = data[data['is_test']].sort_values('test_number')
to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil',
'is_outlier', 'visitors_capped', 'visitors',
'air_area_name', 'latitude', 'longitude', 'visitors_capped_log1p']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')
X_train = train.drop('visitors_log1p', axis='columns')
X_test = test.drop('visitors_log1p', axis='columns')
y_train = train['visitors_log1p']
X_train.head()
|
is_holiday |
prev_day_is_holiday |
next_day_is_holiday |
is_weekend |
day_of_month |
optimized_ewm_by_air_store_id_&_day_of_week |
optimized_ewm_by_air_store_id_&_is_weekend |
optimized_ewm_log1p_by_air_store_id_&_day_of_week |
optimized_ewm_log1p_by_air_store_id_&_is_weekend |
visitors_capped_mean_by_air_store_id_&_day_of_week |
... |
air_genre_name_Dining bar |
air_genre_name_International cuisine |
air_genre_name_Italian/French |
air_genre_name_Izakaya |
air_genre_name_Japanese food |
air_genre_name_Karaoke/Party |
air_genre_name_Okonomiyaki/Monja/Teppanyaki |
air_genre_name_Other |
air_genre_name_Western food |
air_genre_name_Yakiniku/Korean food |
visit_date |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2016-07-15 |
0 |
0.0 |
0.0 |
0 |
15 |
35.000700 |
31.642520 |
3.588106 |
3.425707 |
38.5 |
... |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2016-07-16 |
0 |
0.0 |
0.0 |
1 |
16 |
9.061831 |
8.618812 |
2.302603 |
2.003579 |
10.0 |
... |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2016-07-19 |
0 |
1.0 |
0.0 |
0 |
19 |
24.841272 |
27.988385 |
3.252832 |
2.428565 |
24.5 |
... |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2016-07-20 |
0 |
0.0 |
0.0 |
0 |
20 |
29.198575 |
27.675525 |
3.412813 |
2.667124 |
32.5 |
... |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2016-07-21 |
0 |
0.0 |
0.0 |
0 |
21 |
32.710972 |
26.767268 |
3.537397 |
2.761626 |
31.0 |
... |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rows × 96 columns
y_train.head()
visit_date
2016-07-15 3.367296
2016-07-16 1.791759
2016-07-19 3.258097
2016-07-20 2.995732
2016-07-21 3.871201
Name: visitors_log1p, dtype: float64
(11) 断言语句查看是不是哪还有问题
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019
(12) lightgbm建模
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection
np.random.seed(42)
model = lgbm.LGBMRegressor(
objective='regression',
max_depth=5,
num_leaves=25,
learning_rate=0.007,
n_estimators=1000,
min_child_samples=80,
subsample=0.8,
colsample_bytree=1,
reg_alpha=0,
reg_lambda=0,
random_state=np.random.randint(10e6)
)
n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)
val_scores = [0] * n_splits
sub = submission['id'].to_frame()
sub['visitors'] = 0
feature_importances = pd.DataFrame(index=X_train.columns)
for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
X_fit = X_train.iloc[fit_idx]
y_fit = y_train.iloc[fit_idx]
X_val = X_train.iloc[val_idx]
y_val = y_train.iloc[val_idx]
model.fit(
X_fit,
y_fit,
eval_set=[(X_fit, y_fit), (X_val, y_val)],
eval_names=('fit', 'val'),
eval_metric='l2',
early_stopping_rounds=200,
feature_name=X_fit.columns.tolist(),
verbose=False
)
val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_)
feature_importances[i] = model.feature_importances_
print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
sub['visitors'] /= n_splits
sub['visitors'] = np.expm1(sub['visitors'])
val_mean = np.mean(val_scores)
val_std = np.std(val_scores)
print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))
Fold 1 RMSLE: 0.48936
Fold 2 RMSLE: 0.49091
Fold 3 RMSLE: 0.48654
Fold 4 RMSLE: 0.48831
Fold 5 RMSLE: 0.48788
Fold 6 RMSLE: 0.48706
Local RMSLE: 0.48834 (±0.00146)
输出结果
sub.to_csv('result.csv', index=False)
import pandas as pd
df = pd.read_csv('result.csv')
df.head()
|
id |
visitors |
0 |
air_00a91d42b08b08d9_2017-04-23 |
4.340348 |
1 |
air_00a91d42b08b08d9_2017-04-24 |
22.739363 |
2 |
air_00a91d42b08b08d9_2017-04-25 |
29.535532 |
3 |
air_00a91d42b08b08d9_2017-04-26 |
29.319551 |
4 |
air_00a91d42b08b08d9_2017-04-27 |
31.838669 |
代码部分参考:
https://edu.aliyun.com/course/1915?spm=a2c6h.12873581.0.0.6d6c56815vyMWI
作者:sapienst
多表关联
lightgbm
流量