在用pandas进行数据重排时,经常用到stack
和unstack
两个函数。stack的意思是堆叠,堆积
,unstack即“不要堆叠”
。
表格在行列方向上均有索引(类似于DataFrame),花括号结构只有“列方向”上的索引
(类似于层次化的Series),结构更加偏向于堆叠(Series-stack,方便记忆)。stack函数会将数据从"表格结构"变成"花括号结构",即将其行索引变成列索引
,反之,unstack函数将数据从”花括号结构“变成”表格结构“,即要将其中一层的列索引变成行索引
。
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
>>>
A B
first second
bar one 0.040744 -0.114534
two 1.172007 0.314802
baz one 2.609065 1.809331
two 0.998182 0.630713
foo one -1.592057 -0.383721
two 0.547086 0.964051
qux one -1.547477 -0.014280
two 0.193672 1.180379
stack行索引变成列索引
stacked = df.stack()
stacked
>>>
first second
bar one A 0.040744
B -0.114534
two A 1.172007
B 0.314802
baz one A 2.609065
B 1.809331
two A 0.998182
B 0.630713
foo one A -1.592057
B -0.383721
two A 0.547086
B 0.964051
qux one A -1.547477
B -0.014280
two A 0.193672
B 1.180379
dtype: float64
unstack列索引变成行索引
stacked.unstack()
>>>
A B
first second
bar one 0.040744 -0.114534
two 1.172007 0.314802
baz one 2.609065 1.809331
two 0.998182 0.630713
foo one -1.592057 -0.383721
two 0.547086 0.964051
qux one -1.547477 -0.014280
two 0.193672 1.180379
stacked.unstack().unstack()
>>>
A B
second one two one two
first
bar 0.040744 1.172007 -0.114534 0.314802
baz 2.609065 0.998182 1.809331 0.630713
foo -1.592057 0.547086 -0.383721 0.964051
qux -1.547477 0.193672 -0.014280 1.180379
二、数据透视表(pivot table)
pivot table/轴向旋转表
>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
... 'B' : ['A', 'B', 'C'] * 4,
... 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
... 'D' : np.random.randn(12),
... 'E' : np.random.randn(12)})
>>> df
A B C D E
0 one A foo 0.133675 0.008375
1 one B foo -0.184734 -0.175907
2 two C foo -0.380400 0.020585
3 three A bar 0.447330 0.030309
4 one B bar -0.242254 0.194685
5 one C bar 0.388497 -0.846569
6 two A foo 0.739428 0.737537
7 three B foo 0.171285 -1.558148
8 one C foo -0.458316 -1.048907
9 one A bar -1.319356 1.664459
10 two B bar 0.797775 -0.163211
11 three C bar 2.528107 1.471624
选择D列
数据,以A、B列
为行标签,以C列
为列标签
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A -1.319356 0.133675
B -0.242254 -0.184734
C 0.388497 -0.458316
three A 0.447330 NaN
B NaN 0.171285
C 2.528107 NaN
two A NaN 0.739428
B 0.797775 NaN
C NaN -0.380400
选择'one'类
数据,按照C列分组
并且求各行数据的均值
>>> df[df.A=='one'].groupby('C').mean()
D E
C
bar -0.391038 0.337525
foo -0.169792 -0.405480
三、时间序列
pandas 提供了强大的时间序列功能,比如把秒级的股票数据转换为5分钟周期数据等。
1.时间序列 从20200504开始,以秒为单位记录600个时间节点>>> rng=pd.date_range('20200504',periods=600,freq='s')
>>> rng
DatetimeIndex(['2020-05-04 00:00:00', '2020-05-04 00:00:01',
'2020-05-04 00:00:02', '2020-05-04 00:00:03',
'2020-05-04 00:00:04', '2020-05-04 00:00:05',
'2020-05-04 00:00:06', '2020-05-04 00:00:07',
'2020-05-04 00:00:08', '2020-05-04 00:00:09',
...
'2020-05-04 00:09:50', '2020-05-04 00:09:51',
'2020-05-04 00:09:52', '2020-05-04 00:09:53',
'2020-05-04 00:09:54', '2020-05-04 00:09:55',
'2020-05-04 00:09:56', '2020-05-04 00:09:57',
'2020-05-04 00:09:58', '2020-05-04 00:09:59'],
dtype='datetime64[ns]', length=600, freq='S')
为每一个时间节点匹配一个0-500的数字
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts
2020-05-04 00:00:00 163
2020-05-04 00:00:01 380
2020-05-04 00:00:02 459
2020-05-04 00:00:03 73
...
2020-05-04 00:09:56 189
2020-05-04 00:09:57 242
2020-05-04 00:09:58 398
2020-05-04 00:09:59 281
Freq: S, Length: 600, dtype: int32
以两分钟为一个时间周期,计算每两分钟内所有节点对应的数字的和
>>> ts.resample('2Min', how='sum')
__main__:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
2020-05-04 00:00:00 31323
2020-05-04 00:02:00 30679
2020-05-04 00:04:00 29742
2020-05-04 00:06:00 32315
2020-05-04 00:08:00 27771
Freq: 2T, dtype: int32
2.在不同的时间表达方式间转换
>>> rng = pd.date_range('20160301', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2016-03-31 0.608165
2016-04-30 0.371630
2016-05-31 -0.169328
2016-06-30 0.403367
2016-07-31 0.242917
Freq: M, dtype: float64
2.1 时间戳转为时期 to_period()函数
>>> ps = ts.to_period() #默认按月('M')显示
>>> ps
2016-03 0.608165
2016-04 0.371630
2016-05 -0.169328
2016-06 0.403367
2016-07 0.242917
Freq: M, dtype: float64
2.2 时期转为时间戳 to_timestamp()函数
>>> ps.to_timestamp() #默认how='start',即:ps.to_timestamp(how='start')
2016-03-01 0.608165
2016-04-01 0.371630
2016-05-01 -0.169328
2016-06-01 0.403367
2016-07-01 0.242917
Freq: MS, dtype: float64
2.3 返回固定频率的周期索引 period_range()函数
格式
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)
实例
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='period[Q-NOV]', freq='Q-NOV')
一年有四个季度,每年都以11月份为结束,即是9,10,11三个月为一个季度,以此类推
Q:quarter end frequency
#季度结束频率
(B)Q(S)-NOV :quarterly frequency, year ends in November
#季度频率,年度结束于11月份
>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts
1990Q1 -0.504704
1990Q2 -0.343573
1990Q3 -1.511160
1990Q4 -0.268746
1991Q1 -0.013206
1991Q2 -1.730078
1991Q3 0.193844
1991Q4 0.732594
1992Q1 -0.656539
1992Q2 -0.577248
1992Q3 1.942460
1992Q4 -1.112426
1993Q1 0.040957
1993Q2 -1.572008
1993Q3 -0.130710
1993Q4 -1.019854
1994Q1 0.817004
1994Q2 2.099237
1994Q3 0.574557
1994Q4 -0.775947
1995Q1 -1.034555
1995Q2 -1.449465
1995Q3 -0.472482
1995Q4 0.750370
1996Q1 -0.081299
1996Q2 -0.214759
1996Q3 -0.141590
1996Q4 -0.397613
1997Q1 -0.771929
1997Q2 0.823948
1997Q3 -0.479001
1997Q4 0.624560
1998Q1 0.956335
1998Q2 0.531748
1998Q3 -0.048368
1998Q4 0.183149
1999Q1 1.060150
1999Q2 1.207727
1999Q3 1.309845
1999Q4 0.817419
2000Q1 0.992689
2000Q2 0.187355
2000Q3 1.014541
2000Q4 0.431878
Freq: Q-NOV, dtype: float64
series.index
>>> ts.index
PeriodIndex(['1990-03-01 09:00', '1990-06-01 09:00', '1990-09-01 09:00',
'1990-12-01 09:00', '1991-03-01 09:00', '1991-06-01 09:00',
'1991-09-01 09:00', '1991-12-01 09:00', '1992-03-01 09:00',
'1992-06-01 09:00', '1992-09-01 09:00', '1992-12-01 09:00',
'1993-03-01 09:00', '1993-06-01 09:00', '1993-09-01 09:00',
'1993-12-01 09:00', '1994-03-01 09:00', '1994-06-01 09:00',
'1994-09-01 09:00', '1994-12-01 09:00', '1995-03-01 09:00',
'1995-06-01 09:00', '1995-09-01 09:00', '1995-12-01 09:00',
'1996-03-01 09:00', '1996-06-01 09:00', '1996-09-01 09:00',
'1996-12-01 09:00', '1997-03-01 09:00', '1997-06-01 09:00',
'1997-09-01 09:00', '1997-12-01 09:00', '1998-03-01 09:00',
'1998-06-01 09:00', '1998-09-01 09:00', '1998-12-01 09:00',
'1999-03-01 09:00', '1999-06-01 09:00', '1999-09-01 09:00',
'1999-12-01 09:00', '2000-03-01 09:00', '2000-06-01 09:00',
'2000-09-01 09:00', '2000-12-01 09:00'],
dtype='period[H]', freq='H')
在周期和时间戳之间进行转换,可以使用一些方便的算术函数。在下面的例子中,我们将截至11月的季度频率转换为季度结束后的月初的上午9时:
>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts
1990-03-01 09:00 -0.504704
1990-06-01 09:00 -0.343573
1990-09-01 09:00 -1.511160
1990-12-01 09:00 -0.268746
1991-03-01 09:00 -0.013206
1991-06-01 09:00 -1.730078
1991-09-01 09:00 0.193844
1991-12-01 09:00 0.732594
1992-03-01 09:00 -0.656539
1992-06-01 09:00 -0.577248
1992-09-01 09:00 1.942460
1992-12-01 09:00 -1.112426
1993-03-01 09:00 0.040957
1993-06-01 09:00 -1.572008
1993-09-01 09:00 -0.130710
1993-12-01 09:00 -1.019854
1994-03-01 09:00 0.817004
1994-06-01 09:00 2.099237
1994-09-01 09:00 0.574557
1994-12-01 09:00 -0.775947
1995-03-01 09:00 -1.034555
1995-06-01 09:00 -1.449465
1995-09-01 09:00 -0.472482
1995-12-01 09:00 0.750370
1996-03-01 09:00 -0.081299
1996-06-01 09:00 -0.214759
1996-09-01 09:00 -0.141590
1996-12-01 09:00 -0.397613
1997-03-01 09:00 -0.771929
1997-06-01 09:00 0.823948
1997-09-01 09:00 -0.479001
1997-12-01 09:00 0.624560
1998-03-01 09:00 0.956335
1998-06-01 09:00 0.531748
1998-09-01 09:00 -0.048368
1998-12-01 09:00 0.183149
1999-03-01 09:00 1.060150
1999-06-01 09:00 1.207727
1999-09-01 09:00 1.309845
1999-12-01 09:00 0.817419
2000-03-01 09:00 0.992689
2000-06-01 09:00 0.187355
2000-09-01 09:00 1.014541
2000-12-01 09:00 0.431878
Freq: H, dtype: float64
四、类别数据
>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}
>>> df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
1.astype函数
格式
DataFrame.astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise')
功能>>> df["grade"] = df["raw_grade"].astype("category")
>>> df
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 e e
查看类型
>>> df["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
替换数据['a', 'b', 'e']
变成了对应的["very good", "good", "very bad"]
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
>>> df
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad
2.排序(sort_values())
根据grade的值排序
>>> df.sort_values(by='grade', ascending=True)
id raw_grade grade
0 1 a very good
3 4 a very good
4 5 a very good
1 2 b good
2 3 b good
5 6 e very bad
3.统计个数(groupby().size())
>>> df.groupby("grade").size()
grade
very good 3
good 2
very bad 1
dtype: int64
muguangjingkong
原创文章 65获赞 7访问量 3214
关注
私信
展开阅读全文