Click here to load reader
Upload
david-bourke
View
72
Download
0
Embed Size (px)
Citation preview
An analysis of the Titanic dataset to explore whether port of embarkation influenced survival rates.
PETER REYNOLDSSEAMUS O’ CONGHAILEDAVID BOURKE
Introduction
What is Kaggle, its workings & what it asks competitors to do
The Titanic competition and what it broadly asks competitors to do
The data available to competitors
The question within the Titanic dataset that we focused on
Data Mining & Machine Learning
Data mining is a process whereby we try and “discover novel, interesting and potentially useful patterns from large datasets”
Machine Learning as Lantz (2013) points out is “interested in the development of computer algorithms for transforming data into intelligent action”
How these processes help us discover patterns within large datasets
Our Approach
We chose a Classification approach as it suited the data we were handling.
The Classification tool we used – Decision Tree
Cross and Split Validation
Our use of Rapidminer, what it is, why we chose it.
Implementation
Clean and prepare the data Build a Decision Tree Apply the model Apply The validation model Different types validation sampling models. Export results (Data file) Submit findings to Kaggle
Decision Tree
Cross Validation & Split validation
Linear Divides the example set into partitions
Shuffled Builds Subsets Stratified Builds random subsets Automatic Stratified by default Leave One OutApplies the model line by line to Test set
Results
1st Kaggle prediction accuracy of 24.42%
Revised Model for 2nd attempt
2nd Kaggle prediction accuracy of 77.51%
Results
Survived
Southampton• 197 Passengers
Cherbourg• 83 Passengers
Queenstown• 41 Passengers
Died
Southampton• 719 Passengers
Cherbourg• 187 Passengers
Queenstown• 82 Passengers
Survival Rates
Southampton• 21%
Cherbourg• 31%
Queenstown• 33%
Conclusion
Results show that there is no significant evidence to prove correlation
Queenstown had highest survival rate even though passengers were predominantly 3rd Class passengers
Other models including Naïve Bayes or Random Forest could possibly yield higher prediction accuracy
Scope for future work
Questions?