数据科学包——pandas基础（核心数据结构）

Durriya ·

更新时间:2024-11-10

· 585 次阅读

文章目录一、Series1.创建1.1 从 ndaray 创建1.2 从字典创建1.3 从标量创建2.Series对象2. Series 是类 ndarray 对象2.2 Series 是类字典对象3.标签对齐操作4.name属性二、DataFrame1.创建1.1 从字典创建1.2 从结构化数据中创建1.3 从字典列表创建1.4 从元组字典创建1.5 从 Series 创建2.列选择/增加/删除2.1 选择列2.2 删除列2.3 插入列3.索引和选择4.数据对齐5.使用 numpy 函数6.Tab键自动完成三、Panel 一、Series

Series 是一维带标签的数组，数组里可以放任意的数据（整数，浮点数，字符串，Python Object）。其基本的创建函数是：

s = pd.Series(data, index=index)

其中 index 是一个列表，用来作为数据的标签。data 可以是不同的数据类型：

Python 字典 ndarray 对象一个标量值，如 5 1.创建 1.1 从 ndaray 创建

>>> s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
>>> s
a   -0.485521
b   -0.286831
c    1.292780
d   -0.625325
e   -0.936284
dtype: float64
>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

注意Series，开头S必须大写

>>> s=pd.Series(np.random.randn(5))
>>> s
0   -1.657662
1    0.149248
2    1.728224
3    0.058451
4    0.345831
dtype: float64
>>> s.index
RangeIndex(start=0, stop=5, step=1)

1.2 从字典创建

创建一个字典d，直接转换为Series

>>> s=pd.Series(d)
>>> s
a    0.0
b    1.0
d    3.0
dtype: float64

自定义行标签，字典中若没有对应的键，赋值为NaN

>>> d = {'a' : 0., 'b' : 1., 'd' : 3}
>>> s=pd.Series(d,index=list('absd'))
>>> s
a    0.0
b    1.0
s    NaN
d    3.0
dtype: float64

1.3 从标量创建

>>> s=pd.Series(3,index=range(5))
>>> s
0    3
1    3
2    3
3    3
4    3
dtype: int64

2.Series对象 2. Series 是类 ndarray 对象

numpy 的索引方式。Series也同样可以用

>>> s = pd.Series(np.random.randn(5))
>>> s
0   -0.104885
1    0.375955
2    1.305717
3    0.441162
4   -0.598452
dtype: float64
>>> s[0]
-0.10488490668673565
>>> s[3:]
3    0.441162
4   -0.598452
dtype: float64
>>> np.exp(s)
0    0.900428
1    1.456382
2    3.690336
3    1.554513
4    0.549662
dtype: float64

2.2 Series 是类字典对象

>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.184751
b   -0.006316
c   -1.113671
d   -2.804318
e    1.493505
dtype: float64
>>> s['a']
0.18475101331017024
>>> s['e']=3
>>> s
a    0.184751
b   -0.006316
c   -1.113671
d   -2.804318
e    3.000000
dtype: float64
>>> s['g'] = 100
>>> s
a      0.184751
b     -0.006316
c     -1.113671
d     -2.804318
e      3.000000
g    100.000000
dtype: float64
>>> 'e' in s
True
>>> print( s.get('f'))
None
>>> print( s.get('f', np.nan))
nan
>>> print( s.get('f', 5))
5

3.标签对齐操作

>>> s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
>>> s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
>>> print('{0}\n\n{1}'.format(s1, s2))
a   -0.123366
c   -0.434903
e   -1.064005
dtype: float64
a    0.784026
d   -1.846238
e   -1.247743
dtype: float64
>>> s1 + s2
a   -0.382794
c         NaN
d         NaN
e    4.032780
dtype: float64

4.name属性

>>> s = pd.Series(np.random.randn(5), name='Some Thing')
>>> s
0   -0.025971
1    1.427484
2    0.684746
3    0.928511
4    0.097620
Name: Some Thing, dtype: float64
>>> s.name
'Some Thing'

二、DataFrame

DataFrame 是二维带行标签和列标签的数组。可以把 DataFrame 想成一个 Excel 表格或一个 SQL 数据库的表格，还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。

创建 DataFrame 的基本格式是：

df = pd.DataFrame(data, index=index, columns=columns)

其中 index 是行标签，columns 是列标签，data 可以是下面的数据：

由一维 numpy 数组，list，Series 构成的字典二维 numpy 数组一个 Series 另外的 DataFrame 对象 1.创建 1.1 从字典创建

>>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
...      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
>>> d
{'one': a    1
b    2
c    3
dtype: int64, 'two': a    1
b    2
c    3
d    4
dtype: int64}
>>> pd.DataFrame(d)
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

设置行、列标签，没有对应值显示NaN

>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d    4   NaN
b    2   NaN
a    1   NaN

1.2 从结构化数据中创建

>>> data = [(1, 2.2, 'Hello'), (2, 3., "World")]
>>> data
[(1, 2.2, 'Hello'), (2, 3.0, 'World')]
>>> pd.DataFrame(data)
   0    1      2
0  1  2.2  Hello
1  2  3.0  World
>>> pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])
        A    B      C
first   1  2.2  Hello
second  2  3.0  World

1.3 从字典列表创建

>>> data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
>>> pd.DataFrame(data)
   a   b     c
0  1   2   NaN
1  5  10  20.0
>>> pd.DataFrame(data,index=['first','second'], columns=['a', 'e'])
        a   e
first   1 NaN
second  5 NaN

1.4 从元组字典创建

了解其创建的原理，实际应用中，会通过数据清洗的方式，把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。

>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
...      ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
...      ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
...      ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
...      ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
>>> d
{('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
#多级标签
>>> pd.DataFrame(d)
       a              b
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

1.5 从 Series 创建

>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> pd.DataFrame(s,columns=['A'])
          A
a  0.748728
b -0.119084
c  0.328340
d -1.707235
e  0.205882

2.列选择/增加/删除 2.1 选择列

>>> df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
>>> df
        one       two     three      four
0  0.486625  0.094514  0.733189 -1.137290
1  0.155623 -0.610077  0.424488  0.103686
2 -1.747658 -0.618322 -1.070768  1.638107
3 -0.761408 -0.353779  1.363916  1.663116
4  0.012482  0.385496  0.283480  0.716104
5  0.784946 -0.568144  1.411448  0.187921
>>> df['three'] = df['one'] + df['two']
>>> df
        one       two     three      four
0  0.486625  0.094514  0.581139 -1.137290
1  0.155623 -0.610077 -0.454453  0.103686
2 -1.747658 -0.618322 -2.365981  1.638107
3 -0.761408 -0.353779 -1.115188  1.663116
4  0.012482  0.385496  0.397978  0.716104
5  0.784946 -0.568144  0.216803  0.187921
>>> df['flag'] = df['one'] > 0
>>> df
        one       two     three      four   flag
0  0.486625  0.094514  0.581139 -1.137290   True
1  0.155623 -0.610077 -0.454453  0.103686   True
2 -1.747658 -0.618322 -2.365981  1.638107  False
3 -0.761408 -0.353779 -1.115188  1.663116  False
4  0.012482  0.385496  0.397978  0.716104   True
5  0.784946 -0.568144  0.216803  0.187921   True

2.2 删除列 del函数

>>> del df['three']
>>> df
        one       two      four   flag
0  0.486625  0.094514 -1.137290   True
1  0.155623 -0.610077  0.103686   True
2 -1.747658 -0.618322  1.638107  False
3 -0.761408 -0.353779  1.663116  False
4  0.012482  0.385496  0.716104   True
5  0.784946 -0.568144  0.187921   True

pop函数

>>> four = df.pop('four')
>>> four
0   -1.137290
1    0.103686
2    1.638107
3    1.663116
4    0.716104
5    0.187921
Name: four, dtype: float64
>>> df
        one       two   flag
0  0.486625  0.094514   True
1  0.155623 -0.610077   True
2 -1.747658 -0.618322  False
3 -0.761408 -0.353779  False
4  0.012482  0.385496   True
5  0.784946 -0.568144   True

2.3 插入列

>>> df['five'] = 5
>>> df
        one       two   flag  five
0  0.486625  0.094514   True     5
1  0.155623 -0.610077   True     5
2 -1.747658 -0.618322  False     5
3 -0.761408 -0.353779  False     5
4  0.012482  0.385496   True     5
5  0.784946 -0.568144   True     5
>>> df['one_trunc'] = df['one'][:2]
>>> df
        one       two   flag  five  one_trunc
0  0.486625  0.094514   True     5   0.486625
1  0.155623 -0.610077   True     5   0.155623
2 -1.747658 -0.618322  False     5        NaN
3 -0.761408 -0.353779  False     5        NaN
4  0.012482  0.385496   True     5        NaN
5  0.784946 -0.568144   True     5        NaN

指定插入位置 insert函数

>>> df.insert(1, 'bar', df['one'])
>>> df
        one       bar       two   flag  five  one_trunc
0  0.486625  0.486625  0.094514   True     5   0.486625
1  0.155623  0.155623 -0.610077   True     5   0.155623
2 -1.747658 -1.747658 -0.618322  False     5        NaN
3 -0.761408 -0.761408 -0.353779  False     5        NaN
4  0.012482  0.012482  0.385496   True     5        NaN
5  0.784946  0.784946 -0.568144   True     5        NaN

使用 assign() 方法来插入新列
更方便地使用 methd chains 的方法来实现,df未变

>>> df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
>>> df
   A  B  C  D
0  2  2  4  1
1  2  4  3  1
2  3  1  3  2
3  3  2  4  1
4  2  4  3  2
5  3  4  4  3

添加新的列，值为A列与B列值的商

>>> df.assign(Ratio = df['A'] / df['B'])
   A  B  C  D  Ratio
0  2  2  4  1   1.00
1  2  4  3  1   0.50
2  3  1  3  2   3.00
3  3  2  4  1   1.50
4  2  4  3  2   0.50
5  3  4  4  3   0.75

添加新的列，用自定义函数的方式

>>> df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)
   A  B  C  D  AB_Ratio  CD_Ratio
0  2  2  4  1      1.00         3
1  2  4  3  1      0.50         2
2  3  1  3  2      3.00         1
3  3  2  4  1      1.50         3
4  2  4  3  2      0.50         1
5  3  4  4  3      0.75         1
>>> df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)
   A  B  C  D  AB_Ratio  ABD_Ratio
0  2  2  4  1      1.00       1.00
1  2  4  3  1      0.50       0.50
2  3  1  3  2      3.00       6.00
3  3  2  4  1      1.50       1.50
4  2  4  3  2      0.50       1.00
5  3  4  4  3      0.75       2.25

3.索引和选择

对应的操作，语法和返回结果

选择一列 -> df[col] -> Series 根据行标签选择一行 -> df.loc[label] -> Series 根据行位置选择一行 -> df.iloc[label] -> Series 选择多行 -> df[5:10] -> DataFrame 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame

>>> df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
>>> df
   A  B  C  D
a  2  8  8  2
b  9  2  8  2
c  7  5  1  2
d  8  3  4  2
e  2  1  2  4
f  8  2  7  3

>>> df['B']
a    8
b    2
c    5
d    3
e    1
f    2
Name: B, dtype: int32
>>> df.loc['B']
KeyError: 'B'
>>> df.loc['b']
A    9
B    2
C    8
D    2
Name: b, dtype: int32
>>> df.iloc[0]
A    2
B    8
C    8
D    2
Name: a, dtype: int32
>>> df[1:4]
   A  B  C  D
b  9  2  8  2
c  7  5  1  2
d  8  3  4  2
#显示True位置上对应的行
>>> df[[False, True, True, False, True, False]]
   A  B  C  D
b  9  2  8  2
c  7  5  1  2
e  2  1  2  4

4.数据对齐

DataFrame 在进行数据计算时，会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。

>>> df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
>>> df1
          A         B         C         D
a -1.862886 -1.547650  0.637708  0.350643
b -0.421221 -1.479398 -0.480860  0.166336
c -0.010406 -0.849795  0.034272 -0.589808
d  0.450138  0.391159  0.914933  0.530649
e  1.036746  0.097552  0.914027  0.570200
f -0.215569  0.461338  0.831485  0.816958
g  0.823373  0.656957 -0.243091 -0.469380
h -0.946946  0.017144 -0.647669 -1.496623
i -1.533835  1.253698 -0.340709 -0.113551
j -0.132444  1.058355  0.038903 -0.072712
>>> df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
>>> df2
          A         B         C
c -1.391986 -0.219589 -1.144956
d  0.588511  0.567815  0.545037
e  1.981807  0.274164 -0.895879
f  0.209802  0.031883  0.139088
g -0.338254  1.317608  0.156630
h -0.097541  0.312342 -0.217281
i  0.687546 -0.631277  0.577067

df1+df2，相同的行标签或者列标签相加，不同的显示NaN

>>> df1 + df2
          A         B         C   D
a       NaN       NaN       NaN NaN
b       NaN       NaN       NaN NaN
c -1.402392 -1.069384 -1.110684 NaN
d  1.038649  0.958975  1.459970 NaN
e  3.018553  0.371716  0.018148 NaN
f -0.005767  0.493221  0.970574 NaN
g  0.485119  1.974565 -0.086460 NaN
h -1.044486  0.329486 -0.864950 NaN
i -0.846289  0.622422  0.236357 NaN
j       NaN       NaN       NaN NaN

>>> df1 - df1.iloc[0]
          A         B         C         D
a  0.000000  0.000000  0.000000  0.000000
b  1.441665  0.068252 -1.118567 -0.184308
c  1.852480  0.697855 -0.603436 -0.940452
d  2.313024  1.938809  0.277226  0.180006
e  2.899632  1.645202  0.276319  0.219557
f  1.647317  2.008988  0.193778  0.466314
g  2.686259  2.204607 -0.880798 -0.820024
h  0.915940  1.564794 -1.285376 -1.847267
i  0.329051  2.801349 -0.978417 -0.464194
j  1.730442  2.606005 -0.598804 -0.423355

5.使用 numpy 函数

Pandas 与 numpy 在核心数据结构上是完全兼容的

>>> df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
>>> df
        one       two     three      four
0  1.800023 -0.550830 -1.115527  1.283088
1  0.005457 -0.205792  1.406842  0.253727
2  1.658374  0.220637  0.349239  0.178845
3 -0.087544  0.262716 -0.822376  1.076153
4  0.942431 -1.170636  0.637203 -1.443319
5  0.165776  0.118799  1.792991 -0.923901
6  0.107792 -0.595107  0.090514  0.178640
7 -0.288757  0.414845  0.074528 -2.418104
8  0.082551 -0.935000  0.017684 -0.990776
9 -0.722961  0.816024 -1.634607 -0.774388

计算底数为e的指数函数

>>> np.exp(df)
        one       two     three      four
0  6.049785  0.576471  0.327742  3.607763
1  1.005472  0.814002  4.083041  1.288820
2  5.250764  1.246871  1.417989  1.195835
3  0.916178  1.300457  0.439387  2.933374
4  2.566213  0.310170  1.891183  0.236143
5  1.180309  1.126143  6.007395  0.396968
6  1.113816  0.551504  1.094737  1.195590
7  0.749195  1.514136  1.077376  0.089090
8  1.086054  0.392586  1.017842  0.371288
9  0.485313  2.261491  0.195029  0.460986

array和asarray都可将结构数据转换为ndarray类型。
但是主要区别就是当数据源是ndarray时，
array仍会copy出一个副本，占用新的内存，但asarray不会。

>>> type(np.asarray(df))

判断转化后的数据值与之前的值是否相等

>>> np.asarray(df) == df.values
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])
>>> np.asarray(df) == df
    one   two  three  four
0  True  True   True  True
1  True  True   True  True
2  True  True   True  True
3  True  True   True  True
4  True  True   True  True
5  True  True   True  True
6  True  True   True  True
7  True  True   True  True
8  True  True   True  True
9  True  True   True  True

6.Tab键自动完成

Tab键自动完成功能是对标准Python shell的主要改进之一
在shell中输入表达式时，只要按下Tab键，当前命名空间中任何与已输入的字符串相匹配的变量就会找出来

三、Panel

Panel 是三维带标签的数组。实际上，Pandas 的名称由来就是由 Panel 演进的，即 pan(el)-da(ta)-s。Panel 比较少用，但依然是最重要的基础数据结构之一。

items: 坐标轴 0，索引对应的元素是一个 DataFrame major_axis: 坐标轴 1, DataFrame 里的行标签 minor_axis: 坐标轴 2, DataFrame 里的列标签

>>> pn = pd.Panel(data)
sys:1: FutureWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.
>>> pn
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

查看’Item1’数据

>>> pn['Item1']
          0         1         2
0  1.292038  0.526691 -0.632993
1 -0.400069  0.735345 -0.090232
2  1.912338 -1.056740 -0.140426
3  0.718229 -0.862939 -1.376745

查看pn的信息

>>> pn.items
Index(['Item1', 'Item2'], dtype='object')

>>> pn.major_axis
RangeIndex(start=0, stop=4, step=1)
>>> pn.minor_axis
RangeIndex(start=0, stop=3, step=1)

函数调用

>>> pn.major_xs(pn.major_axis[0])
      Item1     Item2
0  1.292038 -0.072927
1  0.526691  1.713952
2 -0.632993       NaN
>>> pn.minor_xs(pn.major_axis[1])
      Item1     Item2
0  0.526691  1.713952
1  0.735345  0.062300
2 -1.056740 -0.458656
3 -0.862939  0.759974
>>> pn.to_frame()
                Item1     Item2
major minor
0     0      1.292038 -0.072927
      1      0.526691  1.713952
1     0     -0.400069  1.336408
      1      0.735345  0.062300
2     0      1.912338  1.121212
      1     -1.056740 -0.458656
3     0      0.718229 -0.687525
      1     -0.862939  0.759974