Minera˘c~ao de Dados Aplicada - Universidade Federal de...

Preview:

Citation preview

Mineracao de Dados AplicadaThe Pattern Discovery Process

Loıc Cerf

August, 7th 2017DCC – ICEx – UFMG

Practical matters

hello, world

I am:

Loıc Cerf;

French;

Still learning Portuguese;

Your teacher for this course;

A free software advocate.

The Web page of the course, which hosts these slides, ishttp://dcc.ufmg.br/~lcerf/pt/mda.html.

2 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

hello, world

I am:

Loıc Cerf;

French;

Still learning Portuguese;

Your teacher for this course;

A free software advocate.

The Web page of the course, which hosts these slides, ishttp://dcc.ufmg.br/~lcerf/pt/mda.html.

2 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Organization of the course

During the sessions, there will be:

“Theory”;

Practice;

Your (inter)active implication.

Following the ICEx rule, a student who misses more than 25% ofthe course fails it.

The subject is elective but, to pass it, the work is compulsory!

3 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Organization of the course

During the sessions, there will be:

“Theory”;

Practice;

Your (inter)active implication.

Following the ICEx rule, a student who misses more than 25% ofthe course fails it.

The subject is elective but, to pass it, the work is compulsory!

3 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Organization of the course

During the sessions, there will be:

“Theory”;

Practice;

Your (inter)active implication.

Following the ICEx rule, a student who misses more than 25% ofthe course fails it.

The subject is elective but, to pass it, the work is compulsory!

3 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

“Theory”

The “theory” will:

mainly give “big pictures”;

decrease (in volume) along the sessions;

be adapted to your practical needs.

4 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Practice

The practice will:

be done within a data mining platform;

consist of a few practical exercises;

mainly consist of projects in groups of three students with adifferent and freely-chosen dataset for each group.

5 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Assessment

The final mark will be based on:

an exam on the POSIX text-processing commands (20 points);

random questions on the content of the course (20 points);

a non-trivial step of the project (10 points);

the rest of the project (40 points);

the clarity of a 12-page report (5 points);

the clarity of a 20-minute presentation (5 points).

Individual adjustments based on:

questions about the project (in particular, justifications);

the help brought to other groups.

6 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Assessment

The final mark will be based on:

an exam on the POSIX text-processing commands (20 points);

random questions on the content of the course (20 points);

a non-trivial step of the project (10 points);

the rest of the project (40 points);

the clarity of a 12-page report (5 points);

the clarity of a 20-minute presentation (5 points).

Individual adjustments based on:

questions about the project (in particular, justifications);

the help brought to other groups.

6 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Collaboration

Collaboration is good.

Every group will regularly present its current advancement (fiveminutes). The other students are invited to help with remarks andsuggestions.

7 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Collaboration

Collaboration is good.

Every group will regularly present its current advancement (fiveminutes). The other students are invited to help with remarks andsuggestions.

7 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

A little bit of psychology

You (and I) have the natural tendency to prefer easy activities withshort-term recompenses (watching blockbusters vs. watchingdocumentaries, writing tweets vs. writing an article, eating candiesvs. eating healthy, playing video games vs. doing homework, etc).

The solution is not in time management but in “metacognition”.To succeed in a project (in life?), the most important may not bethe intelligence but the resistance to immediate desires.

By imposing a regular work, this ability is trained and, for sure, theresulting work will be better.

8 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

A little bit of psychology

You (and I) have the natural tendency to prefer easy activities withshort-term recompenses (watching blockbusters vs. watchingdocumentaries, writing tweets vs. writing an article, eating candiesvs. eating healthy, playing video games vs. doing homework, etc).

The solution is not in time management but in “metacognition”.To succeed in a project (in life?), the most important may not bethe intelligence but the resistance to immediate desires.

By imposing a regular work, this ability is trained and, for sure, theresulting work will be better.

8 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

A little bit of psychology

You (and I) have the natural tendency to prefer easy activities withshort-term recompenses (watching blockbusters vs. watchingdocumentaries, writing tweets vs. writing an article, eating candiesvs. eating healthy, playing video games vs. doing homework, etc).

The solution is not in time management but in “metacognition”.To succeed in a project (in life?), the most important may not bethe intelligence but the resistance to immediate desires.

By imposing a regular work, this ability is trained and, for sure, theresulting work will be better.

8 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Practical matters

Outline

1 Many perspectives on data mining

2 The pattern discovery process

9 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Outline

1 Many perspectives on data mining

2 The pattern discovery process

10 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Epistemological Perspective

Knowledge passes through patterns;

It is acquired from data;

It is assessed by quality measures.

Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).

11 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Epistemological Perspective

Knowledge passes through patterns;

It is acquired from data;

It is assessed by quality measures.

Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).

11 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Epistemological Perspective

Knowledge passes through patterns;

It is acquired from data;

It is assessed by quality measures.

Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).

11 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Epistemological Perspective

Knowledge passes through patterns;

It is acquired from data;

It is assessed by quality measures.

Data mining is an empiricism. Notice however that it is necessarilybiased (choice of a type of model/pattern, of a quality measure tomaximize, etc.).

11 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Data Mining Perspective

From data to databases to data warehouses to patterns.

Knowledge arises from the organization of the data.

Typical data-mining task

Local pattern discovery: enumerating subsets of a dataset thatstand out of the rest of it.

12 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Data Mining Perspective

From data to databases to data warehouses to patterns.

Knowledge arises from the organization of the data.

Typical data-mining task

Local pattern discovery: enumerating subsets of a dataset thatstand out of the rest of it.

12 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Inductive Databases

Querying data:{d ∈ D | q(d ,D)}

where:

D is a dataset (tuples),

q is a query.

13 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Inductive Databases

Querying patterns:{p ∈ P | Q(p,D)}

where:

P is the pattern space,

D is the dataset,

Q is an inductive query.

13 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Machine Learning Perspective

Patterns are an abstraction of the data. They are information analgorithm can learn from.

Machine learning focuses on the different ways to learn from data.It is the artificial intelligence side of data mining. It has strong tieswith computational statistics and mathematical optimization.

Typical machine-learning task

Supervised classification: learning, from the descriptions of clas-sified objects, what characterizes every class.

14 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Machine Learning Perspective

Patterns are an abstraction of the data. They are information analgorithm can learn from.

Machine learning focuses on the different ways to learn from data.It is the artificial intelligence side of data mining. It has strong tieswith computational statistics and mathematical optimization.

Typical machine-learning task

Supervised classification: learning, from the descriptions of clas-sified objects, what characterizes every class.

14 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Computational Statistics Perspective

Patterns are statistics that summarize the data.

Computational statistics aims to design efficient algorithms thatimplement statistical methods.

Typical computational statistics task

Representative-based clustering: summarizing data as a mixtureof distributions.

15 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Computational Statistics Perspective

Patterns are statistics that summarize the data.

Computational statistics aims to design efficient algorithms thatimplement statistical methods.

Typical computational statistics task

Representative-based clustering: summarizing data as a mixtureof distributions.

15 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Big Data perspective

Thanks to greater and greater computing capabilities, largedatasets can be stored. Big data analytics is about understandingthese many data (what usually requires parallel computing).

Bio-informatics

Genomes are now easily sequenced. The current challenge taskis to understand the expression mechanism (from genomics toproteomics to phenotypes). DNA chips give the expression levelsof tens of thousands of genes in different samples.

Data mining methods are designed with time and spacecomplexities in mind.

16 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Big Data perspective

Thanks to greater and greater computing capabilities, largedatasets can be stored. Big data analytics is about understandingthese many data (what usually requires parallel computing).

Bio-informatics

Genomes are now easily sequenced. The current challenge taskis to understand the expression mechanism (from genomics toproteomics to phenotypes). DNA chips give the expression levelsof tens of thousands of genes in different samples.

Data mining methods are designed with time and spacecomplexities in mind.

16 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Big Data perspective

Thanks to greater and greater computing capabilities, largedatasets can be stored. Big data analytics is about understandingthese many data (what usually requires parallel computing).

Bio-informatics

Genomes are now easily sequenced. The current challenge taskis to understand the expression mechanism (from genomics toproteomics to phenotypes). DNA chips give the expression levelsof tens of thousands of genes in different samples.

Data mining methods are designed with time and spacecomplexities in mind.

16 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Data Science Perspective

Data science takes an application perspective. It encompasseseverything that can be done on data from a specific field, hencemachine learning, computational statistics and data mining. Itemphasizes the necessity to understand the application domain.

17 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Business Intelligence Perspective

Patterns allow a better understanding of a the “activity” of acompany, hence better decisions taken by the manager.

Besides data mining, the business intelligence emphasizesheterogeneous data, reporting, visualization, etc.

18 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Business Intelligence Perspective

Patterns allow a better understanding of a the “activity” of acompany, hence better decisions taken by the manager.

Besides data mining, the business intelligence emphasizesheterogeneous data, reporting, visualization, etc.

18 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Big Brother Perspective

Knowing everything about everyone is a new business. . .

. . . and a political threat. The mere collection of large amount ofpersonal data in centralized repositories is unethical: these datawill eventually be misused. Data anonymization techniques can beemployed.

Be responsible!

19 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Big Brother Perspective

Knowing everything about everyone is a new business. . .

. . . and a political threat. The mere collection of large amount ofpersonal data in centralized repositories is unethical: these datawill eventually be misused. Data anonymization techniques can beemployed.

Be responsible!

19 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Big Brother Perspective

Knowing everything about everyone is a new business. . .

. . . and a political threat. The mere collection of large amount ofpersonal data in centralized repositories is unethical: these datawill eventually be misused. Data anonymization techniques can beemployed.

Be responsible!

19 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Buzz words

Data mining is also known as information discovery, knowledgediscovery, Knowledge Discovery in Databases (KDD), dataanalytics, cognitive computing, etc.

20 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Absence of a Unified Theory

Data mining is a collection of tasks. Each of them can be solvedby various techniques. These techniques are applicable onparticular types of data only.

There is no unified theory of data mining.

21 / 25Loıc Cerf Mineracao de Dados Aplicada

N

Many perspectives on data mining

Absence of a Unified Theory

Data mining is a collection of tasks. Each of them can be solvedby various techniques. These techniques are applicable onparticular types of data only.

There is no unified theory of data mining.

21 / 25Loıc Cerf Mineracao de Dados Aplicada

N

The pattern discovery process

Outline

1 Many perspectives on data mining

2 The pattern discovery process

22 / 25Loıc Cerf Mineracao de Dados Aplicada

N

The pattern discovery process

The naive view of pattern discovery

Raw data Extraction Patterns

c©2005 Tim Morgan (from flickr R©)

These icons are licensed under the Creative Commons Attribution 2.0 License.

23 / 25Loıc Cerf Mineracao de Dados Aplicada

N

The pattern discovery process

The pattern discovery process

Raw data Pre-process Data Extraction Patterns

c©2005 Tim Morgan (from flickr R©)

These icons are licensed under the Creative Commons Attribution 2.0 License.

24 / 25Loıc Cerf Mineracao de Dados Aplicada

N

The pattern discovery process

The pattern discovery process

Raw data Pre-process Data Extraction Patterns

c©2005 Tim Morgan (from flickr R©)

These icons are licensed under the Creative Commons Attribution 2.0 License.

24 / 25Loıc Cerf Mineracao de Dados Aplicada

N

The pattern discovery process

The pattern discovery process

Raw data Pre-process Data Extraction Patterns

c©2005 Tim Morgan (from flickr R©)

These icons are licensed under the Creative Commons Attribution 2.0 License.

24 / 25Loıc Cerf Mineracao de Dados Aplicada

N

License

c©2011–2017 Loıc Cerf

These slides are licensed under the Creative CommonsAttribution-ShareAlike 4.0 International License.

25 / 25Loıc Cerf Mineracao de Dados Aplicada

N