Series 是一维带标签的数组,数组里可以放任意的数据(整数,浮点数,字符串,Python Object)。其基本的创建函数是:
s = pd.Series(data, index=index)
其中 index
是一个列表
,用来作为数据的标签
。data 可以是不同的数据类型:
>>> s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
>>> s
a -0.485521
b -0.286831
c 1.292780
d -0.625325
e -0.936284
dtype: float64
>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
注意Series
,开头S必须大写
>>> s=pd.Series(np.random.randn(5))
>>> s
0 -1.657662
1 0.149248
2 1.728224
3 0.058451
4 0.345831
dtype: float64
>>> s.index
RangeIndex(start=0, stop=5, step=1)
1.2 从字典创建
创建一个字典d,直接转换为Series
>>> s=pd.Series(d)
>>> s
a 0.0
b 1.0
d 3.0
dtype: float64
自定义行标签,字典中若没有对应的键,赋值为NaN
>>> d = {'a' : 0., 'b' : 1., 'd' : 3}
>>> s=pd.Series(d,index=list('absd'))
>>> s
a 0.0
b 1.0
s NaN
d 3.0
dtype: float64
1.3 从标量创建
>>> s=pd.Series(3,index=range(5))
>>> s
0 3
1 3
2 3
3 3
4 3
dtype: int64
2.Series对象
2. Series 是类 ndarray 对象
numpy 的索引方式。Series也同样可以用
>>> s = pd.Series(np.random.randn(5))
>>> s
0 -0.104885
1 0.375955
2 1.305717
3 0.441162
4 -0.598452
dtype: float64
>>> s[0]
-0.10488490668673565
>>> s[3:]
3 0.441162
4 -0.598452
dtype: float64
>>> np.exp(s)
0 0.900428
1 1.456382
2 3.690336
3 1.554513
4 0.549662
dtype: float64
2.2 Series 是类字典对象
>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> s
a 0.184751
b -0.006316
c -1.113671
d -2.804318
e 1.493505
dtype: float64
>>> s['a']
0.18475101331017024
>>> s['e']=3
>>> s
a 0.184751
b -0.006316
c -1.113671
d -2.804318
e 3.000000
dtype: float64
>>> s['g'] = 100
>>> s
a 0.184751
b -0.006316
c -1.113671
d -2.804318
e 3.000000
g 100.000000
dtype: float64
>>> 'e' in s
True
>>> print( s.get('f'))
None
>>> print( s.get('f', np.nan))
nan
>>> print( s.get('f', 5))
5
3.标签对齐操作
>>> s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
>>> s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
>>> print('{0}\n\n{1}'.format(s1, s2))
a -0.123366
c -0.434903
e -1.064005
dtype: float64
a 0.784026
d -1.846238
e -1.247743
dtype: float64
>>> s1 + s2
a -0.382794
c NaN
d NaN
e 4.032780
dtype: float64
4.name属性
>>> s = pd.Series(np.random.randn(5), name='Some Thing')
>>> s
0 -0.025971
1 1.427484
2 0.684746
3 0.928511
4 0.097620
Name: Some Thing, dtype: float64
>>> s.name
'Some Thing'
二、DataFrame
DataFrame 是二维带行标签和列标签的数组。可以把 DataFrame 想成一个 Excel 表格或一个 SQL 数据库的表格,还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。
创建 DataFrame 的基本格式是:
df = pd.DataFrame(data, index=index, columns=columns)
其中 index
是行标签,columns
是列标签,data
可以是下面的数据:
>>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
... 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
>>> d
{'one': a 1
b 2
c 3
dtype: int64, 'two': a 1
b 2
c 3
d 4
dtype: int64}
>>> pd.DataFrame(d)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
设置行、列标签,没有对应值显示NaN
>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
two three
d 4 NaN
b 2 NaN
a 1 NaN
1.2 从结构化数据中创建
>>> data = [(1, 2.2, 'Hello'), (2, 3., "World")]
>>> data
[(1, 2.2, 'Hello'), (2, 3.0, 'World')]
>>> pd.DataFrame(data)
0 1 2
0 1 2.2 Hello
1 2 3.0 World
>>> pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])
A B C
first 1 2.2 Hello
second 2 3.0 World
1.3 从字典列表创建
>>> data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
>>> pd.DataFrame(data)
a b c
0 1 2 NaN
1 5 10 20.0
>>> pd.DataFrame(data,index=['first','second'], columns=['a', 'e'])
a e
first 1 NaN
second 5 NaN
1.4 从元组字典创建
了解其创建的原理,实际应用中,会通过数据清洗的方式,把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。
>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
... ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
... ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
... ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
... ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
>>> d
{('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
#多级标签
>>> pd.DataFrame(d)
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
1.5 从 Series 创建
>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> pd.DataFrame(s,columns=['A'])
A
a 0.748728
b -0.119084
c 0.328340
d -1.707235
e 0.205882
2.列选择/增加/删除
2.1 选择列
>>> df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
>>> df
one two three four
0 0.486625 0.094514 0.733189 -1.137290
1 0.155623 -0.610077 0.424488 0.103686
2 -1.747658 -0.618322 -1.070768 1.638107
3 -0.761408 -0.353779 1.363916 1.663116
4 0.012482 0.385496 0.283480 0.716104
5 0.784946 -0.568144 1.411448 0.187921
>>> df['three'] = df['one'] + df['two']
>>> df
one two three four
0 0.486625 0.094514 0.581139 -1.137290
1 0.155623 -0.610077 -0.454453 0.103686
2 -1.747658 -0.618322 -2.365981 1.638107
3 -0.761408 -0.353779 -1.115188 1.663116
4 0.012482 0.385496 0.397978 0.716104
5 0.784946 -0.568144 0.216803 0.187921
>>> df['flag'] = df['one'] > 0
>>> df
one two three four flag
0 0.486625 0.094514 0.581139 -1.137290 True
1 0.155623 -0.610077 -0.454453 0.103686 True
2 -1.747658 -0.618322 -2.365981 1.638107 False
3 -0.761408 -0.353779 -1.115188 1.663116 False
4 0.012482 0.385496 0.397978 0.716104 True
5 0.784946 -0.568144 0.216803 0.187921 True
2.2 删除列
del函数
>>> del df['three']
>>> df
one two four flag
0 0.486625 0.094514 -1.137290 True
1 0.155623 -0.610077 0.103686 True
2 -1.747658 -0.618322 1.638107 False
3 -0.761408 -0.353779 1.663116 False
4 0.012482 0.385496 0.716104 True
5 0.784946 -0.568144 0.187921 True
pop函数
>>> four = df.pop('four')
>>> four
0 -1.137290
1 0.103686
2 1.638107
3 1.663116
4 0.716104
5 0.187921
Name: four, dtype: float64
>>> df
one two flag
0 0.486625 0.094514 True
1 0.155623 -0.610077 True
2 -1.747658 -0.618322 False
3 -0.761408 -0.353779 False
4 0.012482 0.385496 True
5 0.784946 -0.568144 True
2.3 插入列
>>> df['five'] = 5
>>> df
one two flag five
0 0.486625 0.094514 True 5
1 0.155623 -0.610077 True 5
2 -1.747658 -0.618322 False 5
3 -0.761408 -0.353779 False 5
4 0.012482 0.385496 True 5
5 0.784946 -0.568144 True 5
>>> df['one_trunc'] = df['one'][:2]
>>> df
one two flag five one_trunc
0 0.486625 0.094514 True 5 0.486625
1 0.155623 -0.610077 True 5 0.155623
2 -1.747658 -0.618322 False 5 NaN
3 -0.761408 -0.353779 False 5 NaN
4 0.012482 0.385496 True 5 NaN
5 0.784946 -0.568144 True 5 NaN
指定插入位置 insert函数
>>> df.insert(1, 'bar', df['one'])
>>> df
one bar two flag five one_trunc
0 0.486625 0.486625 0.094514 True 5 0.486625
1 0.155623 0.155623 -0.610077 True 5 0.155623
2 -1.747658 -1.747658 -0.618322 False 5 NaN
3 -0.761408 -0.761408 -0.353779 False 5 NaN
4 0.012482 0.012482 0.385496 True 5 NaN
5 0.784946 0.784946 -0.568144 True 5 NaN
使用 assign() 方法来插入新列>>> df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
>>> df
A B C D
0 2 2 4 1
1 2 4 3 1
2 3 1 3 2
3 3 2 4 1
4 2 4 3 2
5 3 4 4 3
添加新的列,值为A列与B列值的商
>>> df.assign(Ratio = df['A'] / df['B'])
A B C D Ratio
0 2 2 4 1 1.00
1 2 4 3 1 0.50
2 3 1 3 2 3.00
3 3 2 4 1 1.50
4 2 4 3 2 0.50
5 3 4 4 3 0.75
添加新的列,用自定义函数的方式
>>> df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)
A B C D AB_Ratio CD_Ratio
0 2 2 4 1 1.00 3
1 2 4 3 1 0.50 2
2 3 1 3 2 3.00 1
3 3 2 4 1 1.50 3
4 2 4 3 2 0.50 1
5 3 4 4 3 0.75 1
>>> df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)
A B C D AB_Ratio ABD_Ratio
0 2 2 4 1 1.00 1.00
1 2 4 3 1 0.50 0.50
2 3 1 3 2 3.00 6.00
3 3 2 4 1 1.50 1.50
4 2 4 3 2 0.50 1.00
5 3 4 4 3 0.75 2.25
3.索引和选择
对应的操作,语法和返回结果
选择一列 ->df[col] -> Series
根据行标签选择一行 -> df.loc[label] -> Series
根据行位置选择一行 -> df.iloc[label] -> Series
选择多行 -> df[5:10] -> DataFrame
根据布尔向量选择多行 -> df[bool_vector] -> DataFrame
>>> df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
>>> df
A B C D
a 2 8 8 2
b 9 2 8 2
c 7 5 1 2
d 8 3 4 2
e 2 1 2 4
f 8 2 7 3
>>> df['B']
a 8
b 2
c 5
d 3
e 1
f 2
Name: B, dtype: int32
>>> df.loc['B']
KeyError: 'B'
>>> df.loc['b']
A 9
B 2
C 8
D 2
Name: b, dtype: int32
>>> df.iloc[0]
A 2
B 8
C 8
D 2
Name: a, dtype: int32
>>> df[1:4]
A B C D
b 9 2 8 2
c 7 5 1 2
d 8 3 4 2
#显示True位置上对应的行
>>> df[[False, True, True, False, True, False]]
A B C D
b 9 2 8 2
c 7 5 1 2
e 2 1 2 4
4.数据对齐
DataFrame 在进行数据计算时,会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。
>>> df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
>>> df1
A B C D
a -1.862886 -1.547650 0.637708 0.350643
b -0.421221 -1.479398 -0.480860 0.166336
c -0.010406 -0.849795 0.034272 -0.589808
d 0.450138 0.391159 0.914933 0.530649
e 1.036746 0.097552 0.914027 0.570200
f -0.215569 0.461338 0.831485 0.816958
g 0.823373 0.656957 -0.243091 -0.469380
h -0.946946 0.017144 -0.647669 -1.496623
i -1.533835 1.253698 -0.340709 -0.113551
j -0.132444 1.058355 0.038903 -0.072712
>>> df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
>>> df2
A B C
c -1.391986 -0.219589 -1.144956
d 0.588511 0.567815 0.545037
e 1.981807 0.274164 -0.895879
f 0.209802 0.031883 0.139088
g -0.338254 1.317608 0.156630
h -0.097541 0.312342 -0.217281
i 0.687546 -0.631277 0.577067
df1+df2,相同的行标签或者列标签相加,不同的显示NaN
>>> df1 + df2
A B C D
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c -1.402392 -1.069384 -1.110684 NaN
d 1.038649 0.958975 1.459970 NaN
e 3.018553 0.371716 0.018148 NaN
f -0.005767 0.493221 0.970574 NaN
g 0.485119 1.974565 -0.086460 NaN
h -1.044486 0.329486 -0.864950 NaN
i -0.846289 0.622422 0.236357 NaN
j NaN NaN NaN NaN
>>> df1 - df1.iloc[0]
A B C D
a 0.000000 0.000000 0.000000 0.000000
b 1.441665 0.068252 -1.118567 -0.184308
c 1.852480 0.697855 -0.603436 -0.940452
d 2.313024 1.938809 0.277226 0.180006
e 2.899632 1.645202 0.276319 0.219557
f 1.647317 2.008988 0.193778 0.466314
g 2.686259 2.204607 -0.880798 -0.820024
h 0.915940 1.564794 -1.285376 -1.847267
i 0.329051 2.801349 -0.978417 -0.464194
j 1.730442 2.606005 -0.598804 -0.423355
5.使用 numpy 函数
Pandas 与 numpy 在核心数据结构上是完全兼容的
>>> df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
>>> df
one two three four
0 1.800023 -0.550830 -1.115527 1.283088
1 0.005457 -0.205792 1.406842 0.253727
2 1.658374 0.220637 0.349239 0.178845
3 -0.087544 0.262716 -0.822376 1.076153
4 0.942431 -1.170636 0.637203 -1.443319
5 0.165776 0.118799 1.792991 -0.923901
6 0.107792 -0.595107 0.090514 0.178640
7 -0.288757 0.414845 0.074528 -2.418104
8 0.082551 -0.935000 0.017684 -0.990776
9 -0.722961 0.816024 -1.634607 -0.774388
计算底数为e的指数函数
>>> np.exp(df)
one two three four
0 6.049785 0.576471 0.327742 3.607763
1 1.005472 0.814002 4.083041 1.288820
2 5.250764 1.246871 1.417989 1.195835
3 0.916178 1.300457 0.439387 2.933374
4 2.566213 0.310170 1.891183 0.236143
5 1.180309 1.126143 6.007395 0.396968
6 1.113816 0.551504 1.094737 1.195590
7 0.749195 1.514136 1.077376 0.089090
8 1.086054 0.392586 1.017842 0.371288
9 0.485313 2.261491 0.195029 0.460986
array和asarray都可将结构数据转换为ndarray类型。
但是主要区别就是当数据源是ndarray时,
array仍会copy出一个副本,占用新的内存,但asarray不会。
>>> type(np.asarray(df))
判断转化后的数据值与之前的值是否相等
>>> np.asarray(df) == df.values
array([[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]])
>>> np.asarray(df) == df
one two three four
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True
6.Tab键自动完成
Tab键自动完成功能是对标准Python shell的主要改进之一
在shell中输入表达式时,只要按下Tab键,当前命名空间中任何与已输入的字符串相匹配的变量就会找出来
Panel 是三维带标签的数组。实际上,Pandas 的名称由来就是由 Panel 演进的,即 pan(el)-da(ta)-s。Panel 比较少用,但依然是最重要的基础数据结构之一。
items
: 坐标轴 0,索引对应的元素是一个 DataFrame
major_axis
: 坐标轴 1, DataFrame 里的行标签
minor_axis
: 坐标轴 2, DataFrame 里的列标签
>>> pn = pd.Panel(data)
sys:1: FutureWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.
>>> pn
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
查看’Item1’数据
>>> pn['Item1']
0 1 2
0 1.292038 0.526691 -0.632993
1 -0.400069 0.735345 -0.090232
2 1.912338 -1.056740 -0.140426
3 0.718229 -0.862939 -1.376745
查看pn的信息
>>> pn.items
Index(['Item1', 'Item2'], dtype='object')
>>> pn.major_axis
RangeIndex(start=0, stop=4, step=1)
>>> pn.minor_axis
RangeIndex(start=0, stop=3, step=1)
函数调用
>>> pn.major_xs(pn.major_axis[0])
Item1 Item2
0 1.292038 -0.072927
1 0.526691 1.713952
2 -0.632993 NaN
>>> pn.minor_xs(pn.major_axis[1])
Item1 Item2
0 0.526691 1.713952
1 0.735345 0.062300
2 -1.056740 -0.458656
3 -0.862939 0.759974
>>> pn.to_frame()
Item1 Item2
major minor
0 0 1.292038 -0.072927
1 0.526691 1.713952
1 0 -0.400069 1.336408
1 0.735345 0.062300
2 0 1.912338 1.121212
1 -1.056740 -0.458656
3 0 0.718229 -0.687525
1 -0.862939 0.759974
muguangjingkong
原创文章 67获赞 8访问量 3385
关注
私信
展开阅读全文