50
Data Mining 101 Okiriza Wibisono - @okiriza Ali Akbar Septiandri - @aliakbars

Data Mining 101

  • Upload
    karinoy

  • View
    12

  • Download
    4

Embed Size (px)

DESCRIPTION

Intro to Data Mining

Citation preview

  • Data Mining 101Okiriza Wibisono - @okiriza

    Ali Akbar Septiandri - @aliakbars

  • Outline

    Introduction

    Terminology

    Potential application

    Venn diagram

    Process overview

    Business understanding

    Data understanding (exploration)

    Data preparation (preprocessing)

    Modeling

    Evaluation

    Deployment (presentation)

    Tools & Resource

  • Introduction Terminology

    Data mining

    Knowledge Discovery

    in Databases

    Big data analytics

    Statistics

    Data science

  • The process of collecting,

    searching through, and analyzing

    a large amount of data in a

    database, as to discover patterns

    or relationships.Data Mining - dictionary.reference.com

  • Introduction Potential Application

    Customer segmentation

    Recommendation engine

    Social media mining

  • What should we do?

    Where to start? Do I have to get a master degree in statistics?

  • http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg

  • Data Science Venn Diagram

    http://drewconway.com/zia/2013/3/26/the-

    data-science-venn-diagram

  • And now the business process

  • CRISP DM Methodology

    http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

  • Business UnderstandingCRISP DM Methodology

  • Objective Statement

    Bottom-up

    Top-down

  • Objective Statement

    Data Problem

    vs

  • Situation Assessment

    Inventory of Resources

    Requirements, Assumptions, and Constraints

    Risks and Contingencies

    Terminology

    Costs and Benefits

    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

  • Situation Assessment

    Inventory of Resources

    Resource

    Data, Knowledge,

    Tools

    Hardware

    Personnel

    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

  • Situation Assessment

    Requirements, Assumptions, and Constraints

    Requirements

    Scheduling

    Accuracy

    Security

    Assumptions

    Data quality

    External factors

    Reporting type

    Constraints

    Legal issues

    Budget

    Resources

    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

  • Situation Assessment

    Risks and Contingencies

    Contingency Plan

    Financial

    Organizational

    Business

    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

  • Situation Assessment TerminologyWrite down related terminology

    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

    http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg

  • Situation Assessment Costs and BenefitsMoney, money, money!

    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

    http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg

  • How to evaluate the results?

    Define your success criteria!

  • Data UnderstandingCRISP DM Methodology

  • Data Collection

    External Internal

    vs

  • Watch out!

  • visible accessible

    storable presentable

    Victor Lavrenko Text Technologies

    http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf

  • Data Exploration

    Visualization Heuristics

    Visualize fast. Visualize reactively.

    Go for high information 2D visualizations.

    Select data subsets to visualize.

    http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

  • Data Exploration

    Visualization Heuristics

    Never let anomalies pass you by. Dig deeper.

    Use your visualizations to inform potential

    models. Use your potential model to direct your

    visualizations.

    Expect problems in your data.

    http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

  • This is the cheapest and most

    informative stage of data

    mining.

    Nigel Goddard DME Visualization

    http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

  • Data Exploration

    Visualization Tools

    Column/bar: Large change

    Line, curve: Small change, long periods

    Histogram: Frequency distribution

    https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp

  • Data PreparationCRISP DM Methodology

  • Which one should I include

    (or exclude)?

    Data Selection

  • Data Cleaning

    Dirty Data

    Missing value

    Incomplete

    OutdatedDuplication

    OutlierRemember: Expect problems in your data.

  • Data Construction

    Feature engineering derived attributes,

    e.g.:

    year from timestamp

    quarter from timestamp

    BMI from weight and height

    Log(x) for skewed data (e.g. house price)

  • Data Splitting

    Two kinds of data splitting:

    Training-Validation-Testing

    Cross Validation

  • Data Splitting

    Training-Validation-Testing

    Construct classifierTraining

    Pick algorithm

    Knob settings (tree depth, k in kNN, c in SVM)

    Validation

    Estimate future error rateTesting

    Split randomly to avoid bias

    http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf

  • Data Splitting

    Cross ValidationEvery point is both training and testing, never at the same time

  • Dimensionality Reduction

    Principal Component

    Analysis

    Linear Discriminant

    Analysisvs

  • ModelingCRISP DM Methodology

  • Machine Learning

    Classification Regression Ranking Clustering

  • Model Selection

    Regression Technique

    Generalization bound

    Linear regression

    Kernel ridge regression

    Support vector regression

    Lasso

  • Which one should I choose?

    Should I use all of them?

  • It depends on

  • Model Selection

    AssumptionsThe predictors are linearly

    independent

    The error is a random variable with a mean of zero conditional on

    the explanatory variables

    The sample is representative of the population for the inference

    prediction

    Interpretability

    The understandability of why the model is true or how the model is induced

    from

    https://chenhaot.com/pubs/mldg-interpretability.pdf

  • Beware of Overfitting!

    http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png

  • Model Assessment

    Regression

    (R)MSE

    Mean Absolute Error

    Correlation Coefficient

    Classification

    Accuracy

    Precision

    Recall

    F-score

    Descriptive

    Std. Error

    p-value

    Confidence Interval

  • EvaluationCRISP DM Methodology

  • Does my model solve the

    problem?

    What is the impact? Is it novel? How useful is the solution?

  • DeploymentCRISP DM Methodology

  • The Tasks

    Plan deploymentPlan monitoring

    and maintenanceProduce final

    reportReview project

  • Tools & Resource

    Text mining: NLTK, spaCy, OpenNLP

    Query expansion & clustering: Carrot2, Weka

    Data mining & machine learning: Weka, scikit-learn

    Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala

    Python lib: Pandas, SciPy, NumPy, scikit-learn

    Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark

    Visualization: D3.js

    Community: Big Data & Open Data Indonesia

    http://www.nltk.org/http://honnibal.github.io/spaCy/https://opennlp.apache.org/http://project.carrot2.org/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://scikit-learn.org/stable/http://www.r-project.org/https://www.python.org/http://julialang.org/https://www.oracle.com/java/index.htmlhttp://www.mathworks.com/products/matlab/http://www.wolfram.com/mathematica/https://www.haskell.org/http://www.scala-lang.org/http://aws.amazon.com/http://hadoop.apache.org/https://cloud.google.com/http://azure.microsoft.com/en-us/https://spark.apache.org/http://d3js.org/

  • Thank you!

    Data Mining 101 Python-ID Meetup February 2015

    Okiriza Wibisono - @okiriza

    Ali Akbar Septiandri - @aliakbars