Upload
karinoy
View
12
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Intro to Data Mining
Citation preview
Data Mining 101Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars
Outline
Introduction
Terminology
Potential application
Venn diagram
Process overview
Business understanding
Data understanding (exploration)
Data preparation (preprocessing)
Modeling
Evaluation
Deployment (presentation)
Tools & Resource
Introduction Terminology
Data mining
Knowledge Discovery
in Databases
Big data analytics
Statistics
Data science
The process of collecting,
searching through, and analyzing
a large amount of data in a
database, as to discover patterns
or relationships.Data Mining - dictionary.reference.com
Introduction Potential Application
Customer segmentation
Recommendation engine
Social media mining
What should we do?
Where to start? Do I have to get a master degree in statistics?
http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg
Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-
data-science-venn-diagram
And now the business process
CRISP DM Methodology
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
Business UnderstandingCRISP DM Methodology
Objective Statement
Bottom-up
Top-down
Objective Statement
Data Problem
vs
Situation Assessment
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment
Inventory of Resources
Resource
Data, Knowledge,
Tools
Hardware
Personnel
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment
Requirements, Assumptions, and Constraints
Requirements
Scheduling
Accuracy
Security
Assumptions
Data quality
External factors
Reporting type
Constraints
Legal issues
Budget
Resources
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment
Risks and Contingencies
Contingency Plan
Financial
Organizational
Business
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment TerminologyWrite down related terminology
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg
Situation Assessment Costs and BenefitsMoney, money, money!
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg
How to evaluate the results?
Define your success criteria!
Data UnderstandingCRISP DM Methodology
Data Collection
External Internal
vs
Watch out!
visible accessible
storable presentable
Victor Lavrenko Text Technologies
http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
Data Exploration
Visualization Heuristics
Visualize fast. Visualize reactively.
Go for high information 2D visualizations.
Select data subsets to visualize.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration
Visualization Heuristics
Never let anomalies pass you by. Dig deeper.
Use your visualizations to inform potential
models. Use your potential model to direct your
visualizations.
Expect problems in your data.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
This is the cheapest and most
informative stage of data
mining.
Nigel Goddard DME Visualization
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration
Visualization Tools
Column/bar: Large change
Line, curve: Small change, long periods
Histogram: Frequency distribution
https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
Data PreparationCRISP DM Methodology
Which one should I include
(or exclude)?
Data Selection
Data Cleaning
Dirty Data
Missing value
Incomplete
OutdatedDuplication
OutlierRemember: Expect problems in your data.
Data Construction
Feature engineering derived attributes,
e.g.:
year from timestamp
quarter from timestamp
BMI from weight and height
Log(x) for skewed data (e.g. house price)
Data Splitting
Two kinds of data splitting:
Training-Validation-Testing
Cross Validation
Data Splitting
Training-Validation-Testing
Construct classifierTraining
Pick algorithm
Knob settings (tree depth, k in kNN, c in SVM)
Validation
Estimate future error rateTesting
Split randomly to avoid bias
http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
Data Splitting
Cross ValidationEvery point is both training and testing, never at the same time
Dimensionality Reduction
Principal Component
Analysis
Linear Discriminant
Analysisvs
ModelingCRISP DM Methodology
Machine Learning
Classification Regression Ranking Clustering
Model Selection
Regression Technique
Generalization bound
Linear regression
Kernel ridge regression
Support vector regression
Lasso
Which one should I choose?
Should I use all of them?
It depends on
Model Selection
AssumptionsThe predictors are linearly
independent
The error is a random variable with a mean of zero conditional on
the explanatory variables
The sample is representative of the population for the inference
prediction
Interpretability
The understandability of why the model is true or how the model is induced
from
https://chenhaot.com/pubs/mldg-interpretability.pdf
Beware of Overfitting!
http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png
Model Assessment
Regression
(R)MSE
Mean Absolute Error
Correlation Coefficient
Classification
Accuracy
Precision
Recall
F-score
Descriptive
Std. Error
p-value
Confidence Interval
EvaluationCRISP DM Methodology
Does my model solve the
problem?
What is the impact? Is it novel? How useful is the solution?
DeploymentCRISP DM Methodology
The Tasks
Plan deploymentPlan monitoring
and maintenanceProduce final
reportReview project
Tools & Resource
Text mining: NLTK, spaCy, OpenNLP
Query expansion & clustering: Carrot2, Weka
Data mining & machine learning: Weka, scikit-learn
Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala
Python lib: Pandas, SciPy, NumPy, scikit-learn
Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark
Visualization: D3.js
Community: Big Data & Open Data Indonesia
http://www.nltk.org/http://honnibal.github.io/spaCy/https://opennlp.apache.org/http://project.carrot2.org/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://scikit-learn.org/stable/http://www.r-project.org/https://www.python.org/http://julialang.org/https://www.oracle.com/java/index.htmlhttp://www.mathworks.com/products/matlab/http://www.wolfram.com/mathematica/https://www.haskell.org/http://www.scala-lang.org/http://aws.amazon.com/http://hadoop.apache.org/https://cloud.google.com/http://azure.microsoft.com/en-us/https://spark.apache.org/http://d3js.org/
Thank you!
Data Mining 101 Python-ID Meetup February 2015
Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars