写在前面

2003年,我看了人生中的第一部电影,「泰坦尼克号」。四岁的我一定不会想到,15年以后我会和它再以这样的方式重逢。
Titanic 是不少Data Science/Machine Learning新手的入门体验,也有不少人把它作为自己的第一场Kaggle比赛。它提供了891条泰坦尼克号上乘客的信息,包括姓名、性别、舱位、票价...以及最重要的:沉船以后是否活了下来。我们要做的,就是通过这891条数据,来看看一个人能不能再沉船中活下来和其他特征之间的关系,最后通过我们的模型去预测另外的419人是否活下来了。简单明了。

Overview

先看下给的数据里都有哪些东西:

>>> train=pd.read_csv('train.csv')
>>> test=pd.read_csv('test.csv')
>>> print(train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

下面做一下可视化,看一下各特征之间的联系。

>>> sns.barplot(x='Embarked', y='Survived', hue='Sex', data=train)
>>> plt.show()

Kaggle第一站:Titanic-Embarked-20181112

>>> sns.pointplot(x='Pclass', y='Survived', hue='Sex', data=train, palette={'male': 'blue', 'female': 'pink'},
  markers=['*', 'o'], linestyles=['-', '--'])
>>> plt.show()

Kaggle第一站:Titanic-Pclass-20181112

特征工程

这里的思路是:

  • 性别、年龄、舱位「应该是」影响一个乘客能不能活下来的重要因素。
  • 防止过拟合,我们把性别分成几个组:Baby, Child, Teenager, Student, Young Adult, Adult, SeniorUnknown.
  • 出于同样的原因,把票价分成四档,以及提取每个乘客的姓名前缀等..
  • 去掉一些不必要的特征。
def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1, 0, 5, 12, 18, 25, 36, 60, 120)
    group_names = ['Unknown', 'Baby', 'Child', 'Teenager',
                   'Student', 'Young Adult', 'Adult', 'Senior']
    categories = pd.cut(df.Age, bins, labels=group_names)
    df.Age = categories
    return df


def simplify_cabins(df):
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x: x[0])
    return df


def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile',
                   '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=group_names)
    df.Fare = categories
    return df


def format_name(df):
    df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
    df['NmaePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
    return df


def drop_features(df):
    return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)


def transform_features(df):
    df = simplify_ages(df)
    df = simplify_cabins(df)
    df = simplify_fares(df)
    df = format_name(df)
    df = drop_features(df)
    return df

特征处理后的结果:

>>> train = transform_features(train)
>>> test = transform_features(test)
>>> sns.barplot(x='Age',y='Survived',hue='Sex',data=train)
>>> plt.show()

Kaggle第一站:Titanic-Age-20181112

>>> sns.barplot(x='Cabin',y='Survived',hue='Sex',data=train)
>>> plt.show()

Kaggle第一站:Titanic-Cabin-20181112

>>> sns.barplot(x='Fare', y='Survived', hue='Sex', data=train)
>>> plt.show()

Kaggle第一站:Titanic-Fare-20181112

看一下特征之间的相关度:

>>> colormap=plt.cm.viridis
>>> plt.figure(figsize=(12,12))
>>> plt.title('Pearson Correlation',y=1.05,size=15)
>>> sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0,square=True,cmap=colormap,
linecolor='white',annot=True)
>>> plt.show()

Kaggle第一站:Titanic-Pearson-20181112
这里用了Pearson积矩相关系数,天道好轮回,我了解它也不过是上个月的事情。

Label Encoding

这一步主要是对标签进行Normalize. 目的就是让一些字串变成电脑易懂的数值。

def encode_features(train, test):
    features = ['Fare', 'Cabin', 'Age', 'Sex', 'Lname', 'NmaePrefix']
    df_combined = pd.concat([train[features], test[features]])

    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        train[feature] = le.transform(train[feature])
        test[feature] = le.transform(test[feature])
    return train, test

>>> train,text=encode_features(train,test)

数据划分

这里还是把测试集的比例定在0.2,即将80%作为训练集,20%作为测试集。

>>> X_all = train.drop(['Survived', 'PassengerId'], axis=1)
>>> Y_all = train['Survived']

>>> test_amount = 0.2
>>> x_train, x_test, y_train, y_test = train_test_split(
    X_all, Y_all, test_size=test_amount, random_state=23)

调模型

直接调个模型上去试试。不行就换。
试了朴素贝叶斯,支持向量机,最后用了随机森林。

>>> clf = RandomForestClassifier()
>>> parameters = {
    'n_estimators': [4, 6, 9],
    'max_features': ['log2', 'sqrt', 'auto'],
    'criterion': ['entropy', 'gini'],
    'max_depth': [2, 3, 5, 10],
    'min_samples_split': [2, 3, 5],
    'min_samples_leaf': [1, 5, 8]
}
>>> acc_scorer = make_scorer(accuracy_score)
>>> grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
>>> grid_obj = grid_obj.fit(x_train, y_train)
>>> clf = grid_obj.best_estimator_
>>> clf.fit(x_train, y_train)
>>> predictions=clf.predict(x_test)
>>> print(accuracy_score(y_test,predictions))
0.8212290502793296

可以看到得出的Accuracy在0.82左右,但是实际上没有这么高。

交叉验证

这里用KFold来交叉验证。相当于将训练集分割成K个子样本,每次挑一个子样本,剩下的用来训练,交叉验证重复K次,得出K个结果。

>>> from sklearn.cross_validation import KFold

def run_kfold(clf):
	kf=KFold(891,n_folds=10)
	outcomes=[]
	fold=0
	for train_index,test_index in kf:
		fold+=1
		x_train,x_test=X_all.values[train_index],X_all.values[test_index]
		y_train,y_test=Y_all.values[train_index],Y_all.values[test_index]
		clf.fit(x_train,y_train)
		predictions=clf.predict(x_test)
		accuracy=accuracy_score(y_test,predictions)
		outcomes.append(accuracy)
		print('Fold {0} accuracy: {1}'.format(fold,accuracy))
	mean_outcome=np.mean(outcomes)
	print('Mean accuracy :{0}'.format(mean_outcome))

>>> run_kfold(clf)

这里设置了K=10,做了十次交叉验证以后结果如下:

Fold 1 accuracy: 0.766666666667
Fold 2 accuracy: 0.865168539326
Fold 3 accuracy: 0.808988764045
Fold 4 accuracy: 0.808988764045
Fold 5 accuracy: 0.808988764045
Fold 6 accuracy: 0.831460674157
Fold 7 accuracy: 0.786516853933
Fold 8 accuracy: 0.808988764045
Fold 9 accuracy: 0.865168539326
Fold 10 accuracy: 0.820224719101
Mean accuracy :0.817116104869

输出

是时候交作业了。
将预测的结果输出到titanic_predictions.csv.

titanic_predictions.csv

在测试集上的Accuracy是0.74,世界排名第7858...

支付宝扫码打赏 微信打赏

若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏

扫描二维码,分享此文章

Yzstr Andy's Picture
Yzstr Andy

School of Data and Computer Science, SUN YAT-SEN UNIVERSITY