kaggle入门赛TOP%7:泰坦尼克号(1.数据分析,特征处理)基于百度aistudio平台

Pascall ·
更新时间:2024-09-20
· 873 次阅读

目前排名1420/19000

建模预测在下一篇帖子:
https://blog.csdn.net/QianLong_/article/details/105780130

# 查看当前挂载的数据集目录, 该目录下的变更重启环境后会自动还原 # View dataset directory. This directory will be recovered automatically after resetting environment. !ls /home/aistudio/data data31483 # 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢. # View personal work directory. All changes under this directory will be kept even after reset. Please clean unnecessary files in time to speed up environment loading. !ls /home/aistudio/work gender_submission.csv test.csv train.csv # 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例: # If a persistence installation is required, you need to use the persistence path as the following: !mkdir /home/aistudio/external-libraries !pip install seaborn -t /home/aistudio/external-libraries Looking in indexes: https://pypi.mirrors.ustc.edu.cn/simple/ Collecting seaborn [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/70/bd/5e6bf595fe6ee0f257ae49336dd180768c1ed3d7c7155b2fdf894c1c808a/seaborn-0.10.0-py3-none-any.whl (215kB) [K |████████████████████████████████| 225kB 16.7MB/s eta 0:00:01 [?25hCollecting numpy>=1.13.3 (from seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/e7/38/f14d6706ae4fa327bdb023ef40b4d902bccd314d886fac4031687a8acc74/numpy-1.18.3-cp37-cp37m-manylinux1_x86_64.whl (20.2MB) [K |████████████████████████████████| 20.2MB 21kB/s eta 0:00:0131 [?25hCollecting pandas>=0.22.0 (from seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0MB) [K |████████████████████████████████| 10.0MB 465kB/s eta 0:00:01 [?25hCollecting matplotlib>=2.1.2 (from seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/b2/c2/71fcf957710f3ba1f09088b35776a799ba7dd95f7c2b195ec800933b276b/matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4MB) [K |████████████████████████████████| 12.4MB 23kB/s eta 0:00:015 [?25hCollecting scipy>=1.0.1 (from seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1MB) [K |████████████████████████████████| 26.1MB 119kB/s eta 0:00:01 [?25hCollecting pytz>=2017.2 (from pandas>=0.22.0->seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509kB) [K |████████████████████████████████| 512kB 47.9MB/s eta 0:00:01 [?25hCollecting python-dateutil>=2.6.1 (from pandas>=0.22.0->seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB) [K |████████████████████████████████| 235kB 61.2MB/s eta 0:00:01 [?25hCollecting cycler>=0.10 (from matplotlib>=2.1.2->seaborn) Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl Collecting kiwisolver>=1.0.1 (from matplotlib>=2.1.2->seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/31/b9/6202dcae729998a0ade30e80ac00f616542ef445b088ec970d407dfd41c0/kiwisolver-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (88kB) [K |████████████████████████████████| 92kB 42.9MB/s eta 0:00:01 [?25hCollecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=2.1.2->seaborn) [?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl (67kB) [K |████████████████████████████████| 71kB 36.7MB/s eta 0:00:01 [?25hCollecting six>=1.5 (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn) Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl [31mERROR: paddlepaddle 1.7.1 has requirement scipy= "3.5", but you'll have scipy 1.4.1 which is incompatible.[0m Installing collected packages: numpy, pytz, six, python-dateutil, pandas, cycler, kiwisolver, pyparsing, matplotlib, scipy, seaborn Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.2.1 numpy-1.18.3 pandas-1.0.3 pyparsing-2.4.7 python-dateutil-2.8.1 pytz-2019.3 scipy-1.4.1 seaborn-0.10.0 six-1.14.0 # 同时添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可: # Also add the following code, so that every time the environment (kernel) starts, just run the following code: import sys sys.path.append('/home/aistudio/external-libraries')

请点击此处查看本环境基本用法.
Please click here for more detailed instructions.

项目开始 #数据解压 !unzip /home/aistudio/data/data31483/titanic.zip -d /home/aistudio/work/ Archive: /home/aistudio/data/data31483/titanic.zip inflating: /home/aistudio/work/gender_submission.csv inflating: /home/aistudio/work/test.csv inflating: /home/aistudio/work/train.csv #读取数据 import pandas as pd trainSet = pd.read_csv('work/train.csv') testSet = pd.read_csv('work/test.csv') print(trainSet.shape) trainSet.head() (891, 12)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
print(testSet.shape) testSet.head() (418, 11)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
trainSet.info() RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB testSet.info() RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB 数据分析 trainSet.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
数据可视化分析 import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns #绘制相关矩阵 ax = sns.heatmap(trainSet[["Survived","SibSp","Parch","Age","Fare"]].corr(),annot=True, cmap = "coolwarm")

在这里插入图片描述

由上图可知,sibsp 和 parch 的相关度挺高,在特征处理是可能用到

trainSet.hist(figsize=(15,10)) array([[, , ], [, , ], [, , ]], dtype=object)

在这里插入图片描述

查看各个类别的存活率 print(trainSet[["Pclass","Survived"]].groupby(["Pclass"]).mean()) g = sns.FacetGrid(trainSet,col='Survived').map(sns.distplot,'Pclass') Survived Pclass 1 0.629630 2 0.472826 3 0.242363

在这里插入图片描述

trainSet[['Sex','Survived']].groupby(['Sex']).mean()
Survived
Sex
female 0.742038
male 0.188908
print(trainSet[['SibSp','Survived']].groupby(['SibSp']).mean()) g = sns.FacetGrid(trainSet,col='Survived').map(sns.distplot,'SibSp') Survived SibSp 0 0.345395 1 0.535885 2 0.464286 3 0.250000 4 0.166667 5 0.000000 8 0.000000

在这里插入图片描述

print(trainSet[['Parch','Survived']].groupby(['Parch']).mean()) g = sns.FacetGrid(trainSet, col='Survived').map(plt.hist, 'Parch') Survived Parch 0 0.343658 1 0.550847 2 0.500000 3 0.600000 4 0.000000 5 0.200000 6 0.000000

在这里插入图片描述

print(trainSet[['Embarked','Survived']].groupby(['Embarked']).mean()) g = sns.FacetGrid(trainSet, col='Embarked').map(plt.hist, 'Survived') Survived Embarked C 0.553571 Q 0.389610 S 0.336957

在这里插入图片描述

g = sns.FacetGrid(trainSet,col='Survived').map(sns.distplot,'Age')

在这里插入图片描述

由以上存活率可以看出,这些特征对最后结果的预测还是有点影响的,婴儿和青年存活率最高

数据缺失值处理 print(trainSet.isna().sum()) print('-'*50) print(testSet.isna().sum()) PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 -------------------------------------------------- PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64

由上可知,Age,Embarked,Fare是比较重要的特征,可以填补.
Cabin缺失较多,直接丢弃不使用

#对于测试数据中Fare的缺失值用所在等级的均值填补 testSet.groupby('Pclass').Fare.mean() Pclass 1 94.280297 2 22.202104 3 12.459678 Name: Fare, dtype: float64 #找到Fare缺失的一行数据 testSet[testSet.Fare.isnull().values==True]
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
152 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S
#由上图可知,Pclass = 3,故用12.459678填补 testSet.Fare.fillna(12.459678,inplace=True) testSet.Fare[152] 12.459678 #对Embarked特征用众数填补 print(trainSet.Embarked.mode()) trainSet.Embarked.fillna('S',inplace=True) trainSet.Embarked.isnull().sum() 0 S dtype: object 0 print(trainSet.isna().sum()) print('-'*50) print(testSet.isna().sum()) PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 0 dtype: int64 -------------------------------------------------- PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 327 Embarked 0 dtype: int64

下面对Age特征进行缺失值填补,根据Mr,Miss,Mrs,Masters 等称呼可以划分不同的年龄段,根据各个称呼的年龄均值进行填补。

trainSet['Title'] = trainSet.Name.str.extract(' ([A-Za-z]+)\.', expand=False) pd.crosstab(trainSet['Title'],trainSet['Sex']) trainSet = trainSet.drop('Name',axis=1) trainSet.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Miss
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S Mrs
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S Mr
trainSet.Title.value_counts() Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Major 2 Mlle 2 Col 2 Ms 1 Lady 1 Don 1 Capt 1 Mme 1 Sir 1 Jonkheer 1 Countess 1 Name: Title, dtype: int64 import numpy as np trainSet['Title'] = np.where((trainSet.Title=='Capt') | (trainSet.Title=='Countess') | (trainSet.Title=='Don') | (trainSet.Title=='Dona') | (trainSet.Title=='Jonkheer') | (trainSet.Title=='Sir') | (trainSet.Title=='Major') | (trainSet.Title=='Rev') | (trainSet.Title=='Col'),'Other',trainSet.Title) trainSet['Title'] = trainSet['Title'].replace('Ms','Miss') trainSet['Title'] = trainSet['Title'].replace('Mlle','Miss') trainSet['Title'] = trainSet['Title'].replace('Mme','Mrs') trainSet['Title'] = trainSet['Title'].replace('Lady','Miss') trainSet.Title.value_counts() Mr 517 Miss 186 Mrs 126 Master 40 Other 15 Dr 7 Name: Title, dtype: int64 #看下不同称呼的年龄分布 sns.boxplot(data=trainSet,x='Title',y='Age')

在这里插入图片描述

#计算不同称呼均值 trainSet.groupby('Title').Age.mean() Title Dr 42.000000 Master 4.574167 Miss 22.020000 Mr 32.368090 Mrs 35.788991 Other 46.800000 Name: Age, dtype: float64 #用均值进行填补 trainSet['Age'] = np.where((trainSet.Age.isnull()) & (trainSet.Title=='Master'),5, np.where((trainSet.Age.isnull()) & (trainSet.Title=='Miss'),22, np.where((trainSet.Age.isnull()) & (trainSet.Title=='Mr'),32, np.where((trainSet.Age.isnull()) & (trainSet.Title=='Mrs'),36, np.where((trainSet.Age.isnull()) & (trainSet.Title=='Other'),47, np.where((trainSet.Age.isnull()) & (trainSet.Title=='Dr'),42,trainSet.Age)))))) trainSet.Age.isnull().sum() 0

对测试数据进行填补

testSet['Title'] = testSet.Name.str.extract(' ([A-Za-z]+)\.', expand=False) pd.crosstab(testSet['Title'],testSet['Sex']) testSet = testSet.drop('Name',axis=1) testSet.Title.value_counts() Mr 240 Miss 78 Mrs 72 Master 21 Rev 2 Col 2 Dona 1 Ms 1 Dr 1 Name: Title, dtype: int64 testSet["Title"] = np.where(((testSet["Title"]=='Rev') | (testSet["Title"]=='Col') | (testSet["Title"]=='Dona')),'Others',testSet["Title"]) testSet["Title"] = testSet["Title"].replace('Ms','Miss') testSet.Title.value_counts() Mr 240 Miss 79 Mrs 72 Master 21 Others 5 Dr 1 Name: Title, dtype: int64 testSet.groupby('Title').Age.mean() Title Dr 53.000000 Master 7.406471 Miss 21.774844 Mr 32.000000 Mrs 38.903226 Others 42.000000 Name: Age, dtype: float64 #用均值进行填补 testSet['Age'] = np.where((testSet.Age.isnull()) & (testSet.Title=='Master'),7, np.where((testSet.Age.isnull()) & (testSet.Title=='Miss'),22, np.where((testSet.Age.isnull()) & (testSet.Title=='Mr'),32, np.where((testSet.Age.isnull()) & (testSet.Title=='Mrs'),39, np.where((testSet.Age.isnull()) & (testSet.Title=='Other'),42, np.where((testSet.Age.isnull()) & (testSet.Title=='Dr'),53,testSet.Age)))))) testSet.Age.isnull().sum() 0 #缺失值处理完成 trainSet = trainSet.drop('Cabin',axis=1) testSet = testSet.drop('Cabin',axis=1) print(trainSet.isnull().sum()) print('*'*50) print(testSet.isnull().sum()) PassengerId 0 Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 0 Title 0 dtype: int64 ************************************************** PassengerId 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 0 Title 0 dtype: int64 特征处理

新建特征:家庭总成员:Family,是否是母亲:Mother,是否免费上船:Free

trainSet['Family'] = trainSet.SibSp + trainSet.Parch + 1 trainSet['Mother'] = np.where((trainSet.Title=='Mrs') & (trainSet.Parch > 0),1,0) trainSet['Free'] = np.where(trainSet.Fare==0,1,0) trainSet = trainSet.drop(['SibSp','Parch'],axis=1) trainSet.head()
PassengerId Survived Pclass Sex Age Ticket Fare Embarked Title Family Mother Free
0 1 0 3 male 22.0 A/5 21171 7.2500 S Mr 2 0 0
1 2 1 1 female 38.0 PC 17599 71.2833 C Mrs 2 0 0
2 3 1 3 female 26.0 STON/O2. 3101282 7.9250 S Miss 1 0 0
3 4 1 1 female 35.0 113803 53.1000 S Mrs 2 0 0
4 5 0 3 male 35.0 373450 8.0500 S Mr 1 0 0
trainSet = trainSet.drop('Ticket',axis=1) trainSet.head()
PassengerId Survived Pclass Sex Age Fare Embarked Title Family Mother Free
0 1 0 3 male 22.0 7.2500 S Mr 2 0 0
1 2 1 1 female 38.0 71.2833 C Mrs 2 0 0
2 3 1 3 female 26.0 7.9250 S Miss 1 0 0
3 4 1 1 female 35.0 53.1000 S Mrs 2 0 0
4 5 0 3 male 35.0 8.0500 S Mr 1 0 0
testSet['Family'] = testSet.SibSp + testSet.Parch + 1 testSet['Mother'] = np.where((testSet.Title=='Mrs') & (testSet.Parch > 0),1,0) testSet['Free'] = np.where(testSet.Fare==0,1,0) testSet = testSet.drop(['SibSp','Parch','Ticket'],axis=1) testSet.head()
PassengerId Pclass Sex Age Fare Embarked Title Family Mother Free
0 892 3 male 34.5 7.8292 Q Mr 1 0 0
1 893 3 female 47.0 7.0000 S Mrs 2 0 0
2 894 2 male 62.0 9.6875 Q Mr 1 0 0
3 895 3 male 27.0 8.6625 S Mr 1 0 0
4 896 3 female 22.0 12.2875 S Mrs 3 1 0
#保存一下数据,防止丢失 trainSet.to_csv('work/trainSet.csv',index=0) testSet.to_csv('work/testSet.csv',index=0) print(trainSet.groupby('Family').Survived.mean()) g = sns.FacetGrid(trainSet,col='Survived').map(sns.distplot,'Family') Family 1 0.303538 2 0.552795 3 0.578431 4 0.724138 5 0.200000 6 0.136364 7 0.333333 8 0.000000 11 0.000000 Name: Survived, dtype: float64

在这里插入图片描述

trainSet.groupby('Mother').Survived.mean() Mother 0 0.361677 1 0.714286 Name: Survived, dtype: float64 trainSet.groupby('Free').Survived.mean() Free 0 0.389269 1 0.066667 Name: Survived, dtype: float64 #数值化处理 trainSet.head()
PassengerId Survived Pclass Sex Age Fare Embarked Title Family Mother Free
0 1 0 3 male 22.0 7.2500 S Mr 2 0 0
1 2 1 1 female 38.0 71.2833 C Mrs 2 0 0
2 3 1 3 female 26.0 7.9250 S Miss 1 0 0
3 4 1 1 female 35.0 53.1000 S Mrs 2 0 0
4 5 0 3 male 35.0 8.0500 S Mr 1 0 0

Dr 53.000000
Master 7.406471
Miss 21.774844
Mr 32.000000
Mrs 38.903226
Others

trainSet.Embarked = trainSet.Embarked.replace(['S','Q','C'],[1,2,3]) trainSet.Sex = trainSet.Sex.replace(['male','female'],[1,2]) trainSet.Title = trainSet.Title.replace(['Dr','Master','Miss','Mr','Mrs','Others'],[1,2,3,4,5,6]) trainSet.head()
PassengerId Survived Pclass Sex Age Fare Embarked Title Family Mother Free
0 1 0 3 1 22.0 7.2500 1 4 2 0 0
1 2 1 1 2 38.0 71.2833 3 5 2 0 0
2 3 1 3 2 26.0 7.9250 1 3 1 0 0
3 4 1 1 2 35.0 53.1000 1 5 2 0 0
4 5 0 3 1 35.0 8.0500 1 4 1 0 0
testSet.Embarked = testSet.Embarked.replace(['S','Q','C'],[1,2,3]) testSet.Sex = testSet.Sex.replace(['male','female'],[1,2]) testSet.Title = testSet.Title.replace(['Dr','Master','Miss','Mr','Mrs','Others'],[1,2,3,4,5,6]) testSet.head()
PassengerId Pclass Sex Age Fare Embarked Title Family Mother Free
0 892 3 1 34.5 7.8292 2 4 1 0 0
1 893 3 2 47.0 7.0000 1 5 2 0 0
2 894 2 1 62.0 9.6875 2 4 1 0 0
3 895 3 1 27.0 8.6625 1 4 1 0 0
4 896 3 2 22.0 12.2875 1 5 3 1 0
trainSet.Age = trainSet.Age.replace(['Child', 'Young Adult', 'Adult','Older Adult','Senior'],[1,2,3,4,5]) trainSet.head()
PassengerId Survived Pclass Sex Age Fare Embarked Title Family Mother Free
0 1 0 3 1 2 7.2500 1 4 2 0 0
1 2 1 1 2 3 71.2833 3 5 2 0 0
2 3 1 3 2 3 7.9250 1 3 1 0 0
3 4 1 1 2 3 53.1000 1 5 2 0 0
4 5 0 3 1 3 8.0500 1 4 1 0 0
bins = [0,12,24,45,60,testSet.Age.max()] labels = [1,2,3,4,5] testSet["Age"] = pd.cut(testSet["Age"], bins, labels = labels) testSet.head()
PassengerId Pclass Sex Age Fare Embarked Title Family Mother Free
0 892 3 1 3 7.8292 2 4 1 0 0
1 893 3 2 4 7.0000 1 5 2 0 0
2 894 2 1 5 9.6875 2 4 1 0 0
3 895 3 1 3 8.6625 1 4 1 0 0
4 896 3 2 2 12.2875 1 5 3 1 0
#保存一下数据,防止丢失 trainSet.to_csv('work/trainSet1.csv',index=0) /tbody>
```python #保存一下数据,防止丢失 trainSet.to_csv('work/trainSet1.csv',index=0) testSet.to_csv('work/testSet1.csv',index=0) 数据终于处理完了,下面进行建模预测
作者:QianLong_



泰坦尼克号 kaggle 特征 数据 数据分析 top

需要 登录 后方可回复, 如果你还没有账号请 注册新账号