Kaggle presentation friday

An analysis of the Titanic dataset to explore whether port of embarkation influenced survival rates.

PETER REYNOLDSSEAMUS O’ CONGHAILEDAVID BOURKE

Introduction

What is Kaggle, its workings & what it asks competitors to do

The Titanic competition and what it broadly asks competitors to do

The data available to competitors

The question within the Titanic dataset that we focused on

Data Mining & Machine Learning

Data mining is a process whereby we try and “discover novel, interesting and potentially useful patterns from large datasets”

Machine Learning as Lantz (2013) points out is “interested in the development of computer algorithms for transforming data into intelligent action”

How these processes help us discover patterns within large datasets

Our Approach

We chose a Classification approach as it suited the data we were handling.

The Classification tool we used – Decision Tree

Cross and Split Validation

Our use of Rapidminer, what it is, why we chose it.

Implementation

Clean and prepare the data Build a Decision Tree Apply the model Apply The validation model Different types validation sampling models. Export results (Data file) Submit findings to Kaggle

Decision Tree

Cross Validation & Split validation

Linear Divides the example set into partitions

Shuffled Builds Subsets Stratified Builds random subsets Automatic Stratified by default Leave One OutApplies the model line by line to Test set

Results

1st Kaggle prediction accuracy of 24.42%

Revised Model for 2nd attempt

2nd Kaggle prediction accuracy of 77.51%

Results

Survived

Southampton• 197 Passengers

Cherbourg• 83 Passengers

Queenstown• 41 Passengers

Died

Southampton• 719 Passengers

Cherbourg• 187 Passengers

Queenstown• 82 Passengers

Survival Rates

Southampton• 21%

Cherbourg• 31%

Queenstown• 33%

Conclusion

Results show that there is no significant evidence to prove correlation

Queenstown had highest survival rate even though passengers were predominantly 3rd Class passengers

Other models including Naïve Bayes or Random Forest could possibly yield higher prediction accuracy

Scope for future work

Questions?

Data & Analytics

Kaggle presentation friday