Upload
others
View
40
Download
3
Embed Size (px)
Citation preview
Machine Learning Introduction
Machine learning
All the techniques that we have seen until now allow us to buildintelligent systems
The limitation of these systems is that they only can solve theproblems their are programmed for
But we only should consider a system intelligent if is also able toobserve its environment and learn from it
The real intelligence resides in adaptation, to be able to integrate newknowledge, to solve new problems, to learn from mistakes
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 1 / 28
Machine Learning Introduction
Goals
The goal is not to model human learning
The goal is to overcome the limitations of usual AI applications(KBS, planning, NLP, problem solving, ...):
Their limit is in the knowledge that they haveTheir capacities can not reach outside that limits
It is not possible to foresee all possible problems from the beginning
We are looking for programs that can adapt without beingreprogrammed
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 2 / 28
Machine Learning Introduction
Does it works
Does Machine Learning Really Work? Tom Mitchell. AI Magazine 1997
Where and what can machine learning be applied for?
Tasks very difficult to program (face recognition, voice, ...)
Adaptable applications (intelligent interfaces, spam filters,recommendation systems, ...)
Data mining (intelligent data analysis)
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 3 / 28
Machine Learning Introduction
Types of machine learning
Inductive Learning: Models are built from the generalization ofexamples. We look for patterns that explain the commoncharacteristics of the examples.
Deductive Learning: Deduction is applied to obtain generalizationsfrom a solved example and its explanation.
Genetic learning: Algorithms inspired in the theory of evolution areapplied to find general description to groups of examples.
Connexionist learning: Generalization is performed by theadaptation mechanisms of artificial neural networks.
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 4 / 28
Machine Learning Inductive Learning
Inductive learning
Is the area with the most number of methods
Goal: To discover general rules or concepts from a limited set ofexamples (common patterns)
It is based on the search of similar characteristics among examples
All its methods are based on inductive reasoning
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 5 / 28
Machine Learning Inductive Learning
Inductive reasoning vs Deductive reasoning
Inductive reasoning
It obtains general knowledge from specific informationThe knowledge obtained is newIts not truth preserving (new information can invalidate the knowledgeobtained)It has not well founded theory
Deductive reasoning
It obtains general knowledge from general knowledgeThe knowledge is not new (it is implicit in the initial knowledge)New knowledge can not invalidate the knowledge already obtainedIts basis is mathematical logic
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 6 / 28
Machine Learning Inductive Learning
Inductive learning
From a formal point of view its results are invalid
We suppose that a limited number of examples represent thecharacteristics of the concept that we want to learn
Just only one counterexample invalidates the results
But, most of the human learning is inductive!
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 7 / 28
Machine Learning Search and inductive learning
Learning as search (I)
The usual way to view inductive learning is as a search problem
The goal is to discover a function/representation that summarizes thecharacteristics of a set of examples
The space of search is all the possible concepts that can be built
There are different ways to perform the search
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 8 / 28
Machine Learning Search and inductive learning
Learning as search (II)
Space of search: Language used to describe the concepts =⇒ Set ofconcepts that can be described by the language
Search operators: Heuristic operators that allow to explore the spaceof concepts
Heuristic function: Preference function that guides the search (Bias)
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 9 / 28
Machine Learning Search and inductive learning
Types of inductive learning
Supervised inductive learning
Each example is labeled with the concept it belongs toLearning is performed by contrast among conceptsA set of heuristics allows to generate different hypothesisThere is a criteria of preference (bias) that allows to choose the mostsuitable hypothesis for the examplesResult: The concept or concepts that describe better the examples
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 10 / 28
Machine Learning Search and inductive learning
Types of inductive learning
Unsupervised inductive learning
Examples are not labeledWe want to discover a suitable way to cluster the objectsLearning is based on the discovery of similarity/dissimilarity amongexamplesA heuristic preference criteria will guide the searchResult: A partition of the examples and a characterization of thepartitions
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 11 / 28
Machine Learning Decision Trees
Decision Trees
We can learn a concept as the set of questions that allows todistinguish it from others
Using a tree as representation formalism we can store and organizethese questions
Each node from the tree is a question about an attribute
The search space is the set of all possible trees of questions
This representation is equivalent to a DNF (22n)
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 12 / 28
Machine Learning Decision Trees
Decision Trees
To reduce the computational cost of searching this space we must tochoose a bias (what kind of concepts are preferred)
Decision: Tree that gives the minimal description of the goal conceptgiven a set of examples
Reason: Such kind of tree will be the better to predict new instances(the probability that unnecessary conditions appear is reduced)
Occam’s razor: the hypothesis that introduces the fewestassumptions and postulates the fewest entities is to be preferred
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 13 / 28
Machine Learning Decision Trees
Algorithms for Decision Trees
One of the first algorithms for building decision trees is ID3 (Quinlan1986)
It is in the family of algorithms for Top Down Induction DecisionTrees (TDIDT)
ID3 performs a search using a Hill-Climbing strategy in the space ofdecision trees
For each level of the tree an attribute is chosen and the set ofexamples is split using the values of the attribute. This process isrepeated recursively for each partition
The selection of the attribute is performed using an heuristic function
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 14 / 28
Machine Learning Decision Trees
Information Theory
Information theory studies among other things the coding of messagesand the cost of their transmition
If we define a set of messages M = {m1,m2, ...,mn}, each one withprobability P(mi ), we can define the quantity of information (I ) thata message M contains as:
I (M) =n∑
i=1
−P(mi )log(P(mi ))
This value can be interpreted as the information needed todiscriminate the messages from M (Number of bit necessary to codethe messages)
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 15 / 28
Machine Learning Decision Trees
Quantity of information as heuristic
We can use an analogy from message coding assuming that theclasses are messages and the proportion of examples from each classis their probability
A decision tree can be seen as the coding that allows to discriminateamong classes
We are looking for the minimal code that discriminates among classes
Each attribute is evaluated to decide if it is a part of the code
An attribute is better than other if allows to discriminate betteramong classes
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 16 / 28
Machine Learning Decision Trees
Quantity of information as heuristic
At each level of the tree we have to find the attribute that allows tominimize the code (minimizes the size of the tree)
The attribute that allows that is the attribute that left less quantityof information to cover by other attributes
The election of an attribute should result in subsets of examples thatare biased towards one class
We need a measure of the quantity of information not covered by anattribute (Entropy, E )
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 17 / 28
Machine Learning Decision Trees
Information Gain
Quantity of information (X - examples, C - classification)
I (X , C) =∑∀ci∈C
− ]ci]X
log(]ci]X
)
Entropy (A - attribute, [A(x) = vi ] - examples with value vi )
E (X ,A, C) =∑∀vi∈A
][A(x) = vi ]
]XI ([A(x) = vi ], C)
Information Gain
G (X ,A, C) = I (X , C)− E (X ,A, C)
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 18 / 28
Machine Learning Decision Trees
Information Gain
C1 C2 C3
C1 C2 C3 C1 C2 C3C1 C2 C3
A=v3A=v1A=v2
I(X,C)
E(X,A,C)
G(X,A,C)= I(X,C) − E(X,A,C)
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 19 / 28
Machine Learning Decision Trees
ID3 Algorithm
Algorithm: ID3 (X : Examples, C: Classification, A: Attributes)
if all examples are from the same classthen
return a leave with the class nameelse
Compute the quantity of information of the examples (I)foreach attribute in A do
Compute the entropy (E) and the information gain (G)
Pick the attribute that maximizes G (a)Delete a from the list of attributes (A)Generate a root node for the attribute aforeach partition generated by the values of the attribute a do
Treei=ID3(X (a=vi ), C(a=vi ),A-a)generate a new branch with a=vi and Treei
return the root node for a
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 20 / 28
Machine Learning Decision Trees
Example (1)
Let it be the following set of examples
Ex. Eyes Hair Height Class
1 Blue Blonde Tall +2 Blue Dark Medium +3 Brown Dark Medium −4 Green Dark Medium −5 Green Dark Tall +6 Brown Dark Small −7 Green Blonde Small −8 Blue Dark Medium +
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 21 / 28
Machine Learning Decision Trees
Example (2)
I (X ,C ) = −1/2 · log(1/2)− 1/2 · log(1/2) = 1
E (X , eyes) = (blue) 3/8 · (−1 · log(1)− 0 · log(0))
+ (brown) 2/8 · (−1 · log(1)− 0 · log(0))
+ (green) 3/8 · (−1/3 · log(1/3)− 2/3 · log(2/3))
= 0,344
E (X , hair) = (blonde) 2/8 · (−1/2 · log(1/2)− 1/2 · log(1/2))
+ (dark) 6/8 · (−1/2 · log(1/2)− 1/2 · log(1/2))
= 1
E (X , height) = (tall) 2/8 · (−1 · log(1)− 0 · log(0))
+ (medium) 4/8 · (−1/2 · log(1/2)− 1/2 · log(1/2))
+ (small) 2/8 · (0 · log(0)− 1 · log(1))
= 0,5
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 22 / 28
Machine Learning Decision Trees
Example (3)
We can see that the attribute eyes is the one that maximizes the function.
G (X , eyes) = 1− 0,344 = 0,656 ∗G (X , hair) = 1− 1 = 0
G (X , height) = 1− 0,5 = 0,5
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 23 / 28
Machine Learning Decision Trees
Example (4)
This attributes generates the first level of the tree
1,2,8
+
3,6
−
4,7
−
5
+
BROWN
BLUEGREEN
EYES
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 24 / 28
Machine Learning Decision Trees
Example (5)
Now only in the node corresponding to the value green we have a mix ofclasses, so we repeat the process with these examples.
Ex. Hair Height Class
4 Dark Medium −5 Dark Tall +7 Blonde Small −
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 25 / 28
Machine Learning Decision Trees
Example (6)
I (X ,C ) = −1/3 · log(1/3)− 2/3 · log(2/3) = 0,918
E (X , hair) = (blonde) 1/3 · (0 · log(0)− 1 · log(1))
+ (dark) 2/3 · (−1/2 · log(1/2)− 1/2 · log(1/2))
= 0,666
E (X , height) = (tall) 1/3 · (0log(0)− 1 · log(1))
+ (medium) 1/3 · (−1 · log(1)− 0 · log(0))
+ (small) 1/3 · (0 · log(0)− 1 · log(1))
= 0
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 26 / 28
Machine Learning Decision Trees
Example (7)
Now the attribute with the maximum value is eyes.
G (X , hair) = 0,918− 0,666 = 0,252
G (X , height) = 0,918− 0 = 0,918∗
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 27 / 28
Machine Learning Decision Trees
Example (8)
The resulting tree totally is discriminant.
4
−
7
−
5
+
1,2,8
+
3,6
−
EYES
BLUE
BROWNGREEN
HEIGHT
TALL MEDIUM SMALL
BY:© $\© C© (LSI-FIB-UPC) Artificial Intelligence Term 2009/2010 28 / 28