pandas.cut(x, bins, right=True, labels)
x: 数据 bins: 离散化的数目,或者切分的区间 labels: 离散化后各个类别的标签 right: 是否包含区间右边的值import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
def f(x):
if '$' in str(x):
x = str(x).strip('$')
x = str(x).replace(',', '')
else:
x = str(x).replace(',', '')
return float(x)
df['Price'] = df['Price'].apply(f)
df['Mileage'] = df['Mileage'].apply(f)
df.head(5)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 | Used | CLEAN TITLE READY TO RIDE HOME | 6575.0 | Green Bay, Wisconsin, United States | 2009.0 | 33142.0 | Red | Harley-Davidson | NaN | Touring | ... | NaN | FALSE | 100 | NaN | 2920 | Dealer | Clear | True | FALSE | 11.0 |
4 | Used | NaN | 10000.0 | West Bend, Wisconsin, United States | 2012.0 | 17800.0 | Blue | Harley-Davidson | NO WARRANTY | Touring | ... | NaN | FALSE | 100 | 13 | 271 | OWNER | Clear | True | TRUE | 0.0 |
5 rows × 22 columns
df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))
# 计算频数
df['Price_bin'].value_counts()
0 6762
1 659
2 50
3 20
4 2
Name: Price_bin, dtype: int64
%matplotlib inline
df['Price_bin'].value_counts().plot(kind='bar')
df['Price_bin'].hist()
w = [100, 1000, 5000, 10000, 20000, 100000]
df['Price_bin'] = pd.cut(df['Price'], bins=w, labels=range(5))
df[['Price', 'Price_bin']].head(5)
Price | Price_bin | |
---|---|---|
0 | 11412.0 | 3 |
1 | 17200.0 | 3 |
2 | 3872.0 | 1 |
3 | 6575.0 | 2 |
4 | 10000.0 | 2 |
df['Price_bin'].hist()
# 分位数
k = 5
w = [1.0 * i/k for i in range(k+1)]
w
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
# 等频分成5段
df['Price_bin'] = pd.qcut(df['Price'], q=w, labels=range(5))
df['Price_bin'].hist()
# 计算分位点
k = 5
w1 = df['Price'].quantile([1.0 * i/k for i in range(k+1)])
w1
0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 100000.0
Name: Price, dtype: float64
# 一般第一个分位点要比实际小
# 最后一个分位点要比实际大
w1[0] = w[0] * 0.95
w1[1.0] = w1[1.0] * 1.1
w1
0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 110000.0
Name: Price, dtype: float64
# 按照新的分段标准分割
df['Price_bin'] = pd.cut(df['Price'], bins=w1, labels=range(5))
df['Price_bin'].hist()