Kaggle Titanic 83%

by Daniel Pollithy

How I achieved 83% accuracy in the Kaggle Titanic challenge

After trying different things (a custom linear model, different cvs, different random forests, randomized searches, pca) I found the GradientBoostingClassifier with {‘max_depth’: 4, ‘n_estimators’: 200} to perform the best with 0.8282 with StratifiedCV on the training data. This model resulted in a public score of 0.78947 on Kaggle.

Another interesting result was 0.84 accuracy with a RandomForest with CV=10, {‘max_features’: 5, ‘min_samples_leaf’: 7, ‘min_samples_split’: 3, ‘n_estimators’: 41}). It led to the same public score of 0.78947.

I have been searching for better parameters for hours using GridSearch but it did not return anything better than what the initial randomized search returned.

The most work intensive part of this challenge was the preprocessing. In the end I had the following pipeline running:

num_attribs = [u'Pclass', u'Age', u'SibSp', u'Parch', u'Fare', u'Name']
cat_attribs = [u'Sex', u'Embarked', u'Name']

class NumAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):  # no *args or **kargs
        pass

    def fit(self, X, y=None):
        return self  # nothing else to do

    def transform(self, X, y=None):
        X['FamilySize'] = X['SibSp'] + X['Parch'] + 1
        X['IsAlone'] = 1  # initialize to yes/1 is alone
        X['IsAlone'].loc[X['FamilySize'] > 1] = 0

        return X
        
        
class CatAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, stat_min=10):  # no *args or **kargs
        self.stat_min = stat_min

    def fit(self, X, y=None):
        return self  # nothing else to do

    def transform(self, X, y=None):
        # create new features
        X['Title'] = X['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
        title_names = (X['Title'].value_counts() < self.stat_min)
        X['Title'] = X['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)

        X = X.drop(['Name'], axis=1)

        return X

num_pipeline = Pipeline([
    ('selector1', DataFrameSelector(num_attribs)),
    ('name_length', NameLengthAdder()),
    ('pandas_median_fill', PandasMedianFill()),
    ('num_attribs_adder', NumAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector2', DataFrameSelector(cat_attribs)),
    ('pandas_mode_fill', PandasModeFill()),
    ('cat_attribs_adder', CatAttributesAdder()),
    ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
])

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

Loading the data and transforming it by tunneling it through the pipeline is easy:

# load the train and test data
data_train = pd.read_csv('../input/train.csv')
train = data_train.copy()
train_labels = data_train['Survived'].copy()
train = full_pipeline.fit_transform(data_train)

Searching for a RandomForestClassifier looked like this

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30], 
     'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]},
    {'bootstrap': [False], 'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
     'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]},
  ]

forest_reg = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=10,
                           scoring='accuracy', return_train_score=True)
grid_search.fit(train, train_labels)

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)
    
winner = sorted(zip(cvres["mean_test_score"], cvres["params"], ), key=lambda x: x[0], reverse=True)[0]

Predict the test data with a model and write to a kaggle submission compatible csv

# use the forest on the test data
best_forest = rnd_search.best_estimator_

data_test = pd.read_csv('../input/test.csv')
test = full_pipeline.fit_transform(data_test.copy())

test_predictions = best_forest.predict(test)

data_test['Survived'] = test_predictions
data_test.to_csv('solution.csv', columns = ['PassengerId', 'Survived'], index=False)