65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge Discovery and Data Mining 1 (VO) (706.701) Denis Helic ISDS, TU Graz March 2, 2020 Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 1 / 63

Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO)(706.701)

Denis Helic

ISDS, TU Graz

March 2, 2020

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 1 / 63

Page 2: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Lecturer

Name: Denis HelicOffice: ISDS, Petersgasse 116, Room 026

Office hours: Tuesday 12:00-13:00Phone: +43-316/873-30610email: [email protected]

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 2 / 63

Page 3: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Lecturer

Name: Roman KernOffice: Know-Center, Inffeldgasse 13, 6th Floor, Room 072

Office hours: By appointmentPhone: +43-316/873-30860email: [email protected]

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 3 / 63

Page 4: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Lecturer

Name: Tiago SantosOffice: ISDS, Inffeldgasse 16c, 1st Floor

Office hours: By appointmentPhone: +43-316/873- 5607email: [email protected]

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 4 / 63

Page 5: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Language

Lectures in EnglishCommunication in German/EnglishExamination: German/English

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 5 / 63

Page 6: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Welcome and Introduction

2 Course Organization

3 Motivation

4 Course Overview

5 Course Highlights

6 Practical Part: KDDM1 KU

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 6 / 63

Page 7: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Teaching Data Science @ ISDS

Introduction to AI & Data ScienceComputational Methods for StatisticsData Analysis Courses:

Knowledge Discovery and Data Mining 1 (Basics and theory)Knowledge Discovery and Data Mining 2 (Applications)Visual Analytics

Analysis of Web Systems & Data:Computational Social Systems I (Basics)Computational Social Systems IINetwork Science (Theory and applications)

Infrastructure:Data ManagementArchitecture of Database SystemsArchitecture of Machine Learning Systems

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 7 / 63

Page 8: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Course context

Knowledge Discovery and Data Mining 1 (VO) (706.701)Obligatory course Master Software Development and Business (1stSemester)Obligatory elective course in subject catalog “KnowledgeTechnologies” (Computer Science)New major/minor system: Obligatory for Data Science and IntelligentSystems

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 8 / 63

Page 9: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Course context

Knowledge Discovery and Data Mining 1 (KU) (706.702)New major/minor system: Obligatory for Data Science and IntelligentSystemsAn add-on for the theoretical part

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 9 / 63

Page 10: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course

The overall goal of KDDM and related courses is to learn how todiscover patterns and models in data. We aim to discover patternsthat are:

i Valid: hold for new data with high probabilityii Useful: we can base further actions on themiii Unexpected: non-obviousiv Understandable: humans can interpret them

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 10 / 63

Page 11: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course: patterns example

1854 Broad Street cholera outbreakExtracting clusters of cholera outbreak in the city of London in 1854. Thecases clusterd around some intersections of roads in London. These hadcontaminated water wells.

http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 11 / 63

Page 12: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course: patterns example

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 12 / 63

Page 13: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course

Specific goals of this course are to learn about two out of three basicelements of that discovery:

i Tools: advanced mathematical tools from probability theory,linear algebra, information theory, and statistical inference

ii Infrastructure: models of computation for large data (handled in othercourses)

iii Process: steps that are needed to discover patternsI assume here that you already know

i How to program and develop softwareii Mathematical basics from probability theory and linear algebra

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 13 / 63

Page 14: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Welcome and Introduction

Goals of the course

Student goals: to pass the examinationBonus goal for all: to have fun!

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 14 / 63

Page 15: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Calendar

02.03.2020: Course organization, Introduction and Motivation (Denis)09.03.2020: Statistical Data Science (Roman)16.03.2020: Feature Extraction (Roman)23.03.2020: Feature Engineering (Roman)

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 15 / 63

Page 16: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Calendar

30.03.2020: Data Matrices (Denis)20.04.2020: Review of Linear Algebra (Denis) / Project presentations(KU)27.04.2020: Partial Exam 104.05.2020: Principal Component Analysis (Denis)11.05.2020: Singular Value Decomposition (Denis)

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 16 / 63

Page 17: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Calendar

18.05.2020: Recommender Systems: Matrix Factorization (Denis)25.05.2020: Topic Modeling and Non-negative Matrix Factorization(Denis)08.06.2020: Clustering (Roman)15.06.2020: Classification (Denis)22.06.2020: Evaluation (Denis) / Project presentations (KU)29.06.2020: Partial Exam 2 / Final Examination

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 17 / 63

Page 18: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Course Logistics

Course website:https://courses.isds.tugraz.at/dhelic/kddm1/index.htmlSlides are (will be) available on the course websiteAdditional readings, references, links, etc. also on the websiteWe expect that you have basic knowledge in probability theory andlinear algebraTo freshen the knowledge you should solve these problems!This problem is not graded!As a side note: we also expect that you know how to program(relevant for the practical part)

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 18 / 63

Page 19: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Grading

Two partial examinationsFinal examination at end of JuneTwo additional examination dates in summer semesterThree examination dates in winter semesterExamination material: lectures/slides/further readingsIn class we will discuss sample examination questions

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 19 / 63

Page 20: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Partial Examinations

2 written examinationsIn the beginning of a lecture: max 45 minutesEach partial examination 2 questionsDifficulty adjusted to solve both problems in approx. 30 minutesMax 20 points for each questionTotal points: 80

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 20 / 63

Page 21: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Examination

Written examination 90 minutes4 questionsDifficulty adjusted to solve all four in approx. 60 minutesMax 20 points for each questionTotal points: 80

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 21 / 63

Page 22: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Important!

If you take partial examinations it counts as one examination attemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnline

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 22 / 63

Page 23: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Grading

0-40 points: 541-50 points: 451-60 points: 361-70 points: 271-80 points: 1

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 23 / 63

Page 24: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Important!

If you take partial examinations it counts as one examinationattemptIn other words: if you are negative at partial examinations you will getthe negative grade in the TUGOnlineWorst case scenario I: if you get 0 points at the first partialexamination you are negative!In that case you can take final examination in JuneAll together this will count as two attempts

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 24 / 63

Page 25: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Important!

Worst case scenario II: if you are negative after the second partialexamination you need to take the exam in winter termAll together this will count as two attemptsTherefore: take partial examination only if you follow the lecture andlearn in parallelThat is also the advantage: you will learn as you go and noteverything at once

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 25 / 63

Page 26: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

KU Organization

KU organization today after this lectureTwo presentations

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 26 / 63

Page 27: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Organization

Questions?

Raise them now (+1 +1)Ask after the lecture (+1)Visit me in the office hours (+1)Send me an e-mail (±1)As a side note: you should(!) interrupt me immediately (+1 +1 +1)and ask any question you might have during the lecture

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 27 / 63

Page 28: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

Figure: Source: https://www.domo.com/learn/data-never-sleeps-7Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 28 / 63

Page 29: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Motivation

How much information is being produced?

We are producing more data than we are able to storeWe need to extract and describe useful dataUseful data ≪ all dataWe can store useful dataWe can also try to predict future data: store only prediction modelIt is a challenge but also an opportunityFor example, learn about human behavior, spread of diseases, politicalbehavior, etc.

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 29 / 63

Page 30: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 30 / 63

Page 31: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Amazon product recommendationsi Valid: holds for new data with high probabilityii Useful: users can find and explore new productsiii Unexpected: non-obvious and non-trivialiv Understandable: related articles, etc.

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 31 / 63

Page 32: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 32 / 63

Page 33: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 33 / 63

Page 34: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Examples

Twitter earthquake and typhoon predictioni Valid: holds for new data with high probabilityii Useful: can save livesiii Unexpected: non-obvious and non-trivialiv Understandable: trajectories of typhoons, positions of earthquakes

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 34 / 63

Page 35: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Knowledge discovery vs. data mining

Knowledge discovery refers to the entire process, of which knowledgeis the end-productIt is iterative and interactiveData mining refers to a specific step in this processIt is the step consisting of applying data analysis and discoveryalgorithms that produce a particular enumeration of patterns overdataAdditional steps are necessary to ensure that the process producesuseful knowledge

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 35 / 63

Page 36: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

1 Developing an understanding of the application domain and therelevant prior knowledge and identifying the goal of the KDD processfrom the customers viewpoint

2 Creating a target data set: selecting a data set or focusing on asubset of variables or data samples on which discovery is to beperformed

3 Data cleaning and preprocessing: basic operations such as theremoval of noise. If appropriate collecting the necessary informationto model or account for noise, deciding on strategies for handlingmissing data fields, accounting for time sequence information andknown changes

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 36 / 63

Page 37: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process4 Data reduction and projection: finding useful features to represent

the data depending on the goal of the task. Using dimensionalityreduction or transformation methods to reduce the effective numberof variables under consideration or to find invariant representations forthe data

5 Matching the goals of the KDD process step to a particular datamining method e.g. summarization, classification, regression,clustering, etc

6 Choosing the data mining algorithms: selecting methods to be usedfor searching for patterns in the data. This includes deciding whichmodels and parameters may be appropriate e.g. models forcategorical data are different than models on vectors over the reals.Matching a particular data mining method with the overall criteria ofthe KDD process e.g. the enduser may be more interested inunderstanding the model than its predictive capabilities

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 37 / 63

Page 38: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

7 Data mining searching for patterns of interest in a particularrepresentational form or a set of such representations, classificationrules or trees, regression, clustering and so forth. The user cansignificantly aid the data mining method by correctly performing thepreceding steps

8 Interpreting mined patterns: possibly return to any of the steps forfurther iteration. This step can also involve visualization of theextracted patterns, models or visualization of the data given theextracted models

9 Consolidating discovered knowledge: incorporating this knowledgeinto another system for further action or simply documenting it andreporting it to interested parties. This also includes checking for andresolving potential conflicts with previously believed or extractedknowledge

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 38 / 63

Page 39: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

Reading!Knowledge Discovery and Data Mining: Towards a Unifying Framework(1996) Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 39 / 63

Page 40: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Steps in the knowledge discovery process

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 40 / 63

Page 41: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

1 Collect training data (e.g. crawl, clean, preprocess)2 Represent examples (e.g. decide which features, how to weight them,

etc.)3 Distance measure (e.g. what is close vs. what is not close)4 Measure the goodness (e.g. objective function)5 Select an approach (e.g. optimization method)

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 41 / 63

Page 42: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day Sunny Normal Humidity Strong Wind Temp. (°C) Play TennisDay1 Yes No Yes 12 NoDay2 No No No 18 NoDay3 Yes Yes No 21 YesDay4 Yes Yes No 28 YesDay5 Yes Yes No 19 ?

Table: Should I play tennis on Day5?

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 42 / 63

Page 43: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:

𝑑𝑝(𝑥, 𝑦) = (𝑛

∑𝑖=1

|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝

(1)

What is 𝑑2(𝑥, 𝑦)?

Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 43 / 63

Page 44: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:

𝑑𝑝(𝑥, 𝑦) = (𝑛

∑𝑖=1

|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝

(1)

What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)?

Manhattan distance

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 43 / 63

Page 45: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

DefinitionMinkowski distance of order 𝑝 between two vectors 𝐱 and 𝐲 from ℝ𝑛 isgiven by:

𝑑𝑝(𝑥, 𝑦) = (𝑛

∑𝑖=1

|𝑥𝑖 − 𝑦𝑖|𝑝)1/𝑝

(1)

What is 𝑑2(𝑥, 𝑦)? Euclidean distanceWhat is 𝑑1(𝑥, 𝑦)? Manhattan distance

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 43 / 63

Page 46: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 6.164414 9.11043358 16.0623784 7.14142843Day2 6.164414 0. 3.31662479 10.09950494 1.73205081Day3 9.11043358 3.31662479 0. 7. 2.Day4 16.0623784 10.09950494 7. 0. 9.Day5 7.14142843 1.73205081 2. 9. 0.

Table: Euclidean distances between days

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 44 / 63

Page 47: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 8. 11. 18. 9.Day2 8. 0. 5. 12. 3.Day3 11. 5. 0. 7. 2.Day4 18. 12. 7. 0. 9.Day5 9. 3. 2. 9. 0.

Table: Manhattan distances between days

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 45 / 63

Page 48: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 3.72174037 3.66840663 4.47507277 3.5008579Day2 3.72174037 0. 3.27940612 3.76436189 3.23329618Day3 3.66840663 3.27940612 0. 1.35622245 0.38749213Day4 4.47507277 3.76436189 1.35622245 0. 1.74371458Day5 3.5008579 3.23329618 0.38749213 1.74371458 0.

Table: Euclidean distances between days (standardized features)

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 46 / 63

Page 49: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?

Table: Should I play tennis on Day5?

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 47 / 63

Page 50: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day1 Day2 Day3 Day4 Day5Day1 0. 18.8 27.2 46.8 21.6Day2 18.8 0. 10.4 30. 4.8Day3 27.2 10.4 0. 19.6 5.6Day4 46.8 30. 19.6 0. 25.2Day5 21.6 4.8 5.6 25.2 0.

Table: Manhattan distances between days using Fahrenheit and Celsius

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 48 / 63

Page 51: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Quick example of what it all means

Day Sunny Normal Humidity Strong Wind Temp. (°C) Temp (°F) Play TennisDay1 Yes No Yes 12 53.6 NoDay2 No No No 18 64.4 NoDay3 Yes Yes No 21 69.8 YesDay4 Yes Yes No 28 82.4 YesDay5 Yes Yes No 19 66.2 ?

Table: Should I play tennis on Day5?

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 49 / 63

Page 52: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Big picture: KDDM

Probability Theory Linear Algebra Map-Reduce

Mathematical Tools Infrastructure

Knowledge Discovery Process

Information Theory Statistical Inference

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 50 / 63

Page 53: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Teaching Data Science @ ISDS and KDDM1

KDDM1: Basics, theory and KDD process until step 8 (novisualization, no interpretation)KDDM2: Implementation and practice of the theory from KDDM1Visual Analytics: KDD process steps 8 and 9 (interpretation andvisualization)Network Science: graph and networks miningComputational Social Systems I & II: Web systems and socialcomputation

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 51 / 63

Page 54: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Data mining and other fields

Data mining overlaps withi Databases: Large-scale data, simple queriesii Machine learning: Small data, complex models, model parametersiii Statistics: Theory, predictive models, no algorithms

Data mining: Algorithms, simple and predictive models, large-scaledata

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 52 / 63

Page 55: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Overview

Data mining and other fields

Statistics Machine Learning

Databases

Data Mining

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 53 / 63

Page 56: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Highlights

KDDM1: some thoughts

Big data, data science, …Many buzzwordsBut in the end:

i You need to know how to program and how to develop software systemsii You need to understand the math behind data analysis: linear algebra,

probability, statisticsIf you have solid knowledge in (i) and (ii) you are in the top 5% ofdevelopers in the field ;)For the years to come you will be earning a lot of money

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 54 / 63

Page 57: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Course Highlights

KDDM1: some highlights

Recommender systems: decompose the matrix and find out what yourusers like :)PCA, SVD: reduce the dimensionality of the dataNMF: analyze relationships between set of documents with linearalgebraTopic models: analyze relationships between set of documents withprobability theoryBayesian inference

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 55 / 63

Page 58: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

Lecturer

Name: Tiago SantosOffice: ISDS, Inffeldgasse 16c/I, Room ID01104

Office hours: Tuesday 15:00-16:00Phone: +43 316 873 5607Email: [email protected]

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 56 / 63

Page 59: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course Practicals

Why should I be interested?To consolidate and reinforce your (theoretical) knowledge with practicalhands-on experienceHelps a lot with the partial and final examinationsGood preparation for KDDM2If interested: possibility to develop a topic for a project or MSc thesis

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 57 / 63

Page 60: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course Practicals

Your task:Form small groups of 3 or 4 studentsDecide on an interesting practical or research questionDecide which data you need for that questionCrawl/download the data and work on your projectGive two presentations (in English) on the progress and your finalresultsEngage with the class and discuss the results of other groups

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 58 / 63

Page 61: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course PracticalsTopic ideas:

Movie/Game/Music Recommender system construction and analysisSentiment analysis of user posts on social media (Twitter / Reddit /StackOverflow)Controversial topics in social media (statistical analysis, clustering,prediction, etc.)Hate speech topics in social media (statistical analysis, clustering,prediction, etc.)Your own idea! Discuss with Tiago

Very specific examples:Reproduce and extend “Quantifying the Advantage of LookingForward”1, an analysis which found positive correlation between GDPper country and quantity of searches for the future. Does this extendto any year and country? What about other search topics?Similar task but with “Parents mention sons more often than daughterson social media”2

1https://doi.org/10.1038/srep003502https://doi.org/10.1073/pnas.1804996116

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 59 / 63

Page 62: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course Practicals

Organizational details:Presentations take place on 20.04.2020 and 22.06.2020, 14:00-16:00and in the room “HS i11”Presentations need to be sent to Tiago ([email protected]),in PDF format, until 23:59 of the day beforeName the files with the last and first names of the students in thegroup like this: santos_tiago_mustermann_max_wurst_hans.pdf

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 60 / 63

Page 63: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course Practicals

For the first presentation prepare three slides (5 min strict):First slide: Research/Practical QuestionSecond slide: DatasetThird slide: Experimental Design

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 61 / 63

Page 64: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course Practicals

For the second presentation prepare five slides (10 min strict):First slide: MotivationSecond slide: MethodologyThird slide: Experimental SetupFourth slide: ResultsFifth slide: Discussion

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 62 / 63

Page 65: Knowledge Discovery and Data Mining 1 (VO) (706.701) · Teaching Data Science @ ISDS Introduction to AI & Data Science Computational Methods for Statistics Data Analysis Courses:

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Practical Part: KDDM1 KU

KDDM1 Course Practicals

Grading:Presentation (time restrictions will be taken into account)Results

Students which hand-in the first presentation will be graded

Denis Helic (ISDS, TU Graz) KDDM1 March 2, 2020 63 / 63