数据科学包——pandas基础(整形、透视、时间序列、类别)

Pamela ·
更新时间:2024-11-10
· 699 次阅读

文章目录一、数据整形(stack函数)二、数据透视表(pivot table)三、时间序列1.时间序列2.在不同的时间表达方式间转换2.1 时间戳转为时期 to_period()函数2.2 时期转为时间戳 to_timestamp()函数2.3 返回固定频率的周期索引 period_range()函数2.4 series索引四、类别数据1.astype函数2.排序(sort_values())3.统计个数(groupby().size()) 一、数据整形(stack函数)

在用pandas进行数据重排时,经常用到stackunstack两个函数。stack的意思是堆叠,堆积,unstack即“不要堆叠”
 表格在行列方向上均有索引(类似于DataFrame),花括号结构只有“列方向”上的索引(类似于层次化的Series),结构更加偏向于堆叠(Series-stack,方便记忆)。stack函数会将数据从"表格结构"变成"花括号结构",即将其行索引变成列索引,反之,unstack函数将数据从”花括号结构“变成”表格结构“,即要将其中一层的列索引变成行索引

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']])) index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) df >>> A B first second bar one 0.040744 -0.114534 two 1.172007 0.314802 baz one 2.609065 1.809331 two 0.998182 0.630713 foo one -1.592057 -0.383721 two 0.547086 0.964051 qux one -1.547477 -0.014280 two 0.193672 1.180379 stack行索引变成列索引 stacked = df.stack() stacked >>> first second bar one A 0.040744 B -0.114534 two A 1.172007 B 0.314802 baz one A 2.609065 B 1.809331 two A 0.998182 B 0.630713 foo one A -1.592057 B -0.383721 two A 0.547086 B 0.964051 qux one A -1.547477 B -0.014280 two A 0.193672 B 1.180379 dtype: float64 unstack列索引变成行索引 stacked.unstack() >>> A B first second bar one 0.040744 -0.114534 two 1.172007 0.314802 baz one 2.609065 1.809331 two 0.998182 0.630713 foo one -1.592057 -0.383721 two 0.547086 0.964051 qux one -1.547477 -0.014280 two 0.193672 1.180379 stacked.unstack().unstack() >>> A B second one two one two first bar 0.040744 1.172007 -0.114534 0.314802 baz 2.609065 0.998182 1.809331 0.630713 foo -1.592057 0.547086 -0.383721 0.964051 qux -1.547477 0.193672 -0.014280 1.180379 二、数据透视表(pivot table)

pivot table/轴向旋转表

>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3, ... 'B' : ['A', 'B', 'C'] * 4, ... 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, ... 'D' : np.random.randn(12), ... 'E' : np.random.randn(12)}) >>> df A B C D E 0 one A foo 0.133675 0.008375 1 one B foo -0.184734 -0.175907 2 two C foo -0.380400 0.020585 3 three A bar 0.447330 0.030309 4 one B bar -0.242254 0.194685 5 one C bar 0.388497 -0.846569 6 two A foo 0.739428 0.737537 7 three B foo 0.171285 -1.558148 8 one C foo -0.458316 -1.048907 9 one A bar -1.319356 1.664459 10 two B bar 0.797775 -0.163211 11 three C bar 2.528107 1.471624 选择D列数据,以A、B列为行标签,以C列为列标签 >>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) C bar foo A B one A -1.319356 0.133675 B -0.242254 -0.184734 C 0.388497 -0.458316 three A 0.447330 NaN B NaN 0.171285 C 2.528107 NaN two A NaN 0.739428 B 0.797775 NaN C NaN -0.380400 选择'one'类数据,按照C列分组并且求各行数据的均值 >>> df[df.A=='one'].groupby('C').mean() D E C bar -0.391038 0.337525 foo -0.169792 -0.405480 三、时间序列

pandas 提供了强大的时间序列功能,比如把秒级的股票数据转换为5分钟周期数据等。

1.时间序列 从20200504开始,以秒为单位记录600个时间节点 >>> rng=pd.date_range('20200504',periods=600,freq='s') >>> rng DatetimeIndex(['2020-05-04 00:00:00', '2020-05-04 00:00:01', '2020-05-04 00:00:02', '2020-05-04 00:00:03', '2020-05-04 00:00:04', '2020-05-04 00:00:05', '2020-05-04 00:00:06', '2020-05-04 00:00:07', '2020-05-04 00:00:08', '2020-05-04 00:00:09', ... '2020-05-04 00:09:50', '2020-05-04 00:09:51', '2020-05-04 00:09:52', '2020-05-04 00:09:53', '2020-05-04 00:09:54', '2020-05-04 00:09:55', '2020-05-04 00:09:56', '2020-05-04 00:09:57', '2020-05-04 00:09:58', '2020-05-04 00:09:59'], dtype='datetime64[ns]', length=600, freq='S') 为每一个时间节点匹配一个0-500的数字 >>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) >>> ts 2020-05-04 00:00:00 163 2020-05-04 00:00:01 380 2020-05-04 00:00:02 459 2020-05-04 00:00:03 73 ... 2020-05-04 00:09:56 189 2020-05-04 00:09:57 242 2020-05-04 00:09:58 398 2020-05-04 00:09:59 281 Freq: S, Length: 600, dtype: int32 以两分钟为一个时间周期,计算每两分钟内所有节点对应的数字的和 >>> ts.resample('2Min', how='sum') __main__:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).sum() 2020-05-04 00:00:00 31323 2020-05-04 00:02:00 30679 2020-05-04 00:04:00 29742 2020-05-04 00:06:00 32315 2020-05-04 00:08:00 27771 Freq: 2T, dtype: int32 2.在不同的时间表达方式间转换 >>> rng = pd.date_range('20160301', periods=5, freq='M') >>> ts = pd.Series(np.random.randn(len(rng)), index=rng) >>> ts 2016-03-31 0.608165 2016-04-30 0.371630 2016-05-31 -0.169328 2016-06-30 0.403367 2016-07-31 0.242917 Freq: M, dtype: float64 2.1 时间戳转为时期 to_period()函数 >>> ps = ts.to_period() #默认按月('M')显示 >>> ps 2016-03 0.608165 2016-04 0.371630 2016-05 -0.169328 2016-06 0.403367 2016-07 0.242917 Freq: M, dtype: float64 2.2 时期转为时间戳 to_timestamp()函数 >>> ps.to_timestamp() #默认how='start',即:ps.to_timestamp(how='start') 2016-03-01 0.608165 2016-04-01 0.371630 2016-05-01 -0.169328 2016-06-01 0.403367 2016-07-01 0.242917 Freq: MS, dtype: float64 2.3 返回固定频率的周期索引 period_range()函数 格式 pandas.period_range(start=None, end=None, periods=None, freq=None, name=None) 实例 >>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV') >>> prng PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2', '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4', '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2', '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4', '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2', '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4', '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2', '2000Q3', '2000Q4'], dtype='period[Q-NOV]', freq='Q-NOV')

一年有四个季度,每年都以11月份为结束,即是9,10,11三个月为一个季度,以此类推
Q:quarter end frequency #季度结束频率
(B)Q(S)-NOV :quarterly frequency, year ends in November #季度频率,年度结束于11月份

2.4 series索引 >>> ts = pd.Series(np.random.randn(len(prng)), prng) >>> ts 1990Q1 -0.504704 1990Q2 -0.343573 1990Q3 -1.511160 1990Q4 -0.268746 1991Q1 -0.013206 1991Q2 -1.730078 1991Q3 0.193844 1991Q4 0.732594 1992Q1 -0.656539 1992Q2 -0.577248 1992Q3 1.942460 1992Q4 -1.112426 1993Q1 0.040957 1993Q2 -1.572008 1993Q3 -0.130710 1993Q4 -1.019854 1994Q1 0.817004 1994Q2 2.099237 1994Q3 0.574557 1994Q4 -0.775947 1995Q1 -1.034555 1995Q2 -1.449465 1995Q3 -0.472482 1995Q4 0.750370 1996Q1 -0.081299 1996Q2 -0.214759 1996Q3 -0.141590 1996Q4 -0.397613 1997Q1 -0.771929 1997Q2 0.823948 1997Q3 -0.479001 1997Q4 0.624560 1998Q1 0.956335 1998Q2 0.531748 1998Q3 -0.048368 1998Q4 0.183149 1999Q1 1.060150 1999Q2 1.207727 1999Q3 1.309845 1999Q4 0.817419 2000Q1 0.992689 2000Q2 0.187355 2000Q3 1.014541 2000Q4 0.431878 Freq: Q-NOV, dtype: float64 series.index >>> ts.index PeriodIndex(['1990-03-01 09:00', '1990-06-01 09:00', '1990-09-01 09:00', '1990-12-01 09:00', '1991-03-01 09:00', '1991-06-01 09:00', '1991-09-01 09:00', '1991-12-01 09:00', '1992-03-01 09:00', '1992-06-01 09:00', '1992-09-01 09:00', '1992-12-01 09:00', '1993-03-01 09:00', '1993-06-01 09:00', '1993-09-01 09:00', '1993-12-01 09:00', '1994-03-01 09:00', '1994-06-01 09:00', '1994-09-01 09:00', '1994-12-01 09:00', '1995-03-01 09:00', '1995-06-01 09:00', '1995-09-01 09:00', '1995-12-01 09:00', '1996-03-01 09:00', '1996-06-01 09:00', '1996-09-01 09:00', '1996-12-01 09:00', '1997-03-01 09:00', '1997-06-01 09:00', '1997-09-01 09:00', '1997-12-01 09:00', '1998-03-01 09:00', '1998-06-01 09:00', '1998-09-01 09:00', '1998-12-01 09:00', '1999-03-01 09:00', '1999-06-01 09:00', '1999-09-01 09:00', '1999-12-01 09:00', '2000-03-01 09:00', '2000-06-01 09:00', '2000-09-01 09:00', '2000-12-01 09:00'], dtype='period[H]', freq='H')

在周期和时间戳之间进行转换,可以使用一些方便的算术函数。在下面的例子中,我们将截至11月的季度频率转换为季度结束后的月初的上午9时:

>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9 >>> ts 1990-03-01 09:00 -0.504704 1990-06-01 09:00 -0.343573 1990-09-01 09:00 -1.511160 1990-12-01 09:00 -0.268746 1991-03-01 09:00 -0.013206 1991-06-01 09:00 -1.730078 1991-09-01 09:00 0.193844 1991-12-01 09:00 0.732594 1992-03-01 09:00 -0.656539 1992-06-01 09:00 -0.577248 1992-09-01 09:00 1.942460 1992-12-01 09:00 -1.112426 1993-03-01 09:00 0.040957 1993-06-01 09:00 -1.572008 1993-09-01 09:00 -0.130710 1993-12-01 09:00 -1.019854 1994-03-01 09:00 0.817004 1994-06-01 09:00 2.099237 1994-09-01 09:00 0.574557 1994-12-01 09:00 -0.775947 1995-03-01 09:00 -1.034555 1995-06-01 09:00 -1.449465 1995-09-01 09:00 -0.472482 1995-12-01 09:00 0.750370 1996-03-01 09:00 -0.081299 1996-06-01 09:00 -0.214759 1996-09-01 09:00 -0.141590 1996-12-01 09:00 -0.397613 1997-03-01 09:00 -0.771929 1997-06-01 09:00 0.823948 1997-09-01 09:00 -0.479001 1997-12-01 09:00 0.624560 1998-03-01 09:00 0.956335 1998-06-01 09:00 0.531748 1998-09-01 09:00 -0.048368 1998-12-01 09:00 0.183149 1999-03-01 09:00 1.060150 1999-06-01 09:00 1.207727 1999-09-01 09:00 1.309845 1999-12-01 09:00 0.817419 2000-03-01 09:00 0.992689 2000-06-01 09:00 0.187355 2000-09-01 09:00 1.014541 2000-12-01 09:00 0.431878 Freq: H, dtype: float64 四、类别数据 >>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']} >>> df id raw_grade 0 1 a 1 2 b 2 3 b 3 4 a 4 5 a 5 6 e 1.astype函数 格式 DataFrame.astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise') 功能
将pandas对象转换为指定的dtype类型。 实例 >>> df["grade"] = df["raw_grade"].astype("category") >>> df id raw_grade grade 0 1 a a 1 2 b b 2 3 b b 3 4 a a 4 5 a a 5 6 e e 查看类型 >>> df["grade"].cat.categories Index(['a', 'b', 'e'], dtype='object') 替换数据
['a', 'b', 'e']变成了对应的["very good", "good", "very bad"] >>> df["grade"].cat.categories = ["very good", "good", "very bad"] >>> df id raw_grade grade 0 1 a very good 1 2 b good 2 3 b good 3 4 a very good 4 5 a very good 5 6 e very bad 2.排序(sort_values())

根据grade的值排序

>>> df.sort_values(by='grade', ascending=True) id raw_grade grade 0 1 a very good 3 4 a very good 4 5 a very good 1 2 b good 2 3 b good 5 6 e very bad 3.统计个数(groupby().size()) >>> df.groupby("grade").size() grade very good 3 good 2 very bad 1 dtype: int64 muguangjingkong 原创文章 65获赞 7访问量 3214 关注 私信 展开阅读全文
作者:muguangjingkong



类别 数据 数据科学 时间序列 pandas 科学

需要 登录 后方可回复, 如果你还没有账号请 注册新账号