Prénom Nom
Document Analysis:Fundamentals of pattern recognition
Prof. Rolf Ingold, University of Fribourg
Master course, spring semester 2008
© Prof. Rolf Ingold
2
Outline
Introduction Feature extraction and decision Role of training Feature selection
Example : Font recognition Bayesian decision theory Evaluation
© Prof. Rolf Ingold
3
Goals of Pattern Recognition
Pattern recognition aims at discovering and identifying patterns in raw data it consists of assigning symbols to data (patterns) it is based on a a priori knowledge, often statistical information
Pattern recognition is used for computer perception (image/sound analysis) in a preliminary step, a sensor captures raw information this information is interpreted to take decisions
Pattern recognition can be thought as a methodic way of reducing the information in order to keep only the relevant meaning
© Prof. Rolf Ingold
4
Pattern Recognition Applications
Pattern recognition is involved in many applications seismological survey speech recognition scientific imagery (biology, health-care, physics, ...) satellite based observation (military and civil applications, ...) document analysis, with several components:
optical character recognition (OCR) font identification handwriting recognition (off-line ) graphics recognition
computer vision (3D scene analysis) biometry: person identification and authentication ...
Pattern recognition methodologies rely on other scientific domains: statistics, operation research, graph theory, artificial intelligence, ...
© Prof. Rolf Ingold
5
Origin of Difficulties
Pattern recognition is mainly an information overload problem The difficulty is issued from
variability of objects belonging to the same class distortion of captured data (noise, degradations, ...)
© Prof. Rolf Ingold
6
Steps Involved in Pattern Recognition
Pattern recognition is basically a two stage process: Feature extraction, aiming at removing redundancy while
keeping significant information Classification, consisting in making a decision by associating a
class label
observation feature vector
class
123.0
789.6
345.12
© Prof. Rolf Ingold
7
Role of Training
Features classesdecision
training
extraction
Models
Classifiers (tools that perform classification tasks) are generally designed to be trained Each class is characterized by a model Models are built with representative training data
© Prof. Rolf Ingold
8
Supervised vs. Unsupervised Training
Two different situations may occur regarding training material:
Supervised training is performed when the training samples are labeled with the class they belong to each class is associated with a set of training samples
Ti={xi1, xi2,..., xiNi},
supposed to be statistically representative for the class
Unsupervised training is performed when the training samples are statistically representative but mixed over all classes
T={x1, x2,..., xn},
© Prof. Rolf Ingold
9
Feature Selection
Features are selected accordingly to the application
Features should be chosen carefully by considering discrimination power between classes robustness to intra-class distortions and noise global statistical independency (spread over the entire feature
space) "fast computation" reasonable dimension (number of features)
© Prof. Rolf Ingold
10
Features for Character Recognition
Given a binary image of a character, a lot of features can be used for character recognition Size, i.e., width and height of the bounding box Position of baseline (if available) Weight (number of black pixels) Perimeter (length of the contours) Center of gravity Moments (second and third order in both directions) Distributions of horizontal and vertical runs Number of intersections with a (eventually random) set of lines Length and structure (singular points, holes) of skeleton ... Local features computed on sub-images …
© Prof. Rolf Ingold
11
Font Recognition: Goal
Goal: recognize fonts of synthetically generated isolated words as binary (black & white) or grey level images at 300 dpi
12 standard font classes are considered 3 families:
Arial Courier New Times New Roman
4 styles: Plain Italic Bold Bold Italic
single size : 12 pt
© Prof. Rolf Ingold
12
Font Recognition: Extracted Features
Words are segmented with a surrounding white border of 1 pixel
Some preprocessing steps are used Horizontal projection profile (hp) Derivative of horizontal projection profile (hpd)
The following features are calculated hp-mean (or density): mean of hp hpd-stdev (or slanting): standard deviation of hpd hr-mean: mean of horizontal runs (up to length 12) hr-stdev: standard deviation of horizontal runs (up to length 12) vr-mean: mean of vertical runs (up to length 12) vr-stdev: standard vertical of horizontal runs (up to length 12)
© Prof. Rolf Ingold
13
Font Recognition: Illustration of Features
Basic image processing features used are horizontal projection profile distribution of horizontal runs (from 1 to 11) distribution of vertical runs (from 1 to 11)
© Prof. Rolf Ingold
14
Font Recognition: decision boundaries on single feature (1)
Some single features are highly discriminant for some font sets
hpd-stdev is discriminating ■ roman and ■ italic fonts
hr-mean is discriminating ■ normal and ■ bold fonts
10
20
30
40
50
60
10
20
30
40
50
60
© Prof. Rolf Ingold
15
Font Recognition: decision boundaries on single feature (2)
Other features may partly discriminate font sets
hr-mean can partly discriminate ■ Arial, ■ Courier and ■ Times
10
20
30
40
© Prof. Rolf Ingold
16
Font Recognition: decision boundaries on multiple features (1)
By combining two features, font discrimination is improved
(hpd-stdev, vr-stdev) discriminate ■ roman and ■ italic fonts
2.5 5 7.5 10 12.5 15
0.5
1
1.5
2
2.5
3
hpd-stdev
vr-stdev
© Prof. Rolf Ingold
17
3 4 5 6 7 8
0.5
1
1.5
2
2.5
3
7.5 10 12.5 15 17.5 20
4
5
6
7
4 5 6 7
0.5
1
1.5
2
2.5
3
Font Recognition: decision boundaries on multiple features (2)
font family discrimination (■ Arial, ■ Courier and ■ Times) becomes possible by combining several couple of features
4 5 6 7
0.5
1
1.5
2
2.5
3
7.5 10 12.5 15 17.5 20
4
5
6
7
3 4 5 6 7 8
0.5
1
1.5
2
2.5
3
vr mean hp mean
hr mean
vr stdev
hr meanvr stdev
© Prof. Rolf Ingold
18
Bayesian Decision Theory
Bayesian decision makes the assumption that all information contributing to the decision can be stated in form of probabilities
P(i): the a priori probability (or prior) of each class
p(x|i): the class conditional density function of the feature
vector x, also called likelihood of the class i with respect to x
The goal is to determine the class i, for which the a posteriori
probability (or posterior) P(i|x) is the highest
© Prof. Rolf Ingold
19
Bayesian Rule
The Bayes rule allows to calculate the a posteriori probability of each class, as a function of priors and likelihoods
where
p(x) is called evidence and can be considered as a normalization factor, i.e.,
)(
)()|()|(
x
xx
p
PpP ii
i
j
jj Ppp )()|()( xx
1)()|(
)()|(
)(
)()|()|(
j
jj
iii
i
ii
ii Pp
Pp
p
PpP
x
x
x
xx
© Prof. Rolf Ingold
20
Influence of Posterior Probabilities
P(1)=0.5, P(2)=0.5 P(1)=0.1, P(2)=0.9
)|( i xP )|( i xP
Example with a single feature: posterior probabilities in two different cases regarding a priori probabilities
20 30 40 50
0.2
0.4
0.6
0.8
1
20 30 40 50
0.2
0.4
0.6
0.8
1
20 30 40 50
0.02
0.04
0.06
0.08
0.1
0.12
0.14
20 30 40 50
0.02
0.04
0.06
0.08
0.1
)|( i xP
)|( ixp )()|( ii Pxp
)|( i xP
21
21
© Prof. Rolf Ingold
21
Probability of Error
Given a feature x of a given sample, the probability of error for a decision (x)=i is equal to
The probability of error is given by
ij
ij xPxPxerrorP )|(1)|()|(
dxxpxerrorPdxxerrorPerrorP )()|(),()(
© Prof. Rolf Ingold
22
Optimal Decision Boundaries
The minimal error is obtained by the decision (x)=i with
jxPxP ji )|()|(
© Prof. Rolf Ingold
23
Decision Theory
In the simplest case a decision consist in assigning to an observation x a class label i = x
A natural extension consists in adding a “rejection class” R so that
xR
In the most general case, the decision results in an action i = x
© Prof. Rolf Ingold
24
Optimal Decision Theory
Let us consider a loss function ij defining the loss incurred by
taking action i when the true state of nature is j ; usually
The risk of taking an action i for a particular sample x is
The optimal decision consists in choosing i that minimizes the risk
ji
jiji 0
0),(
j
jjii PR )|(ω)ω|λ(α)|(α xx
ijRR ji )|(α)|(α xx
© Prof. Rolf Ingold
25
Optimal decision
When ii = 0 and ij = 1 j ≠ i , the optimal decision
consists of minimizing the probability of error
The minimal error is obtained by the decision (x)=i with
or equivalently
In the case when all a priori probabilities are equivalent
jxPxP ji )|()|(
jPxpPxp jjii )()|()()|(
jxpxp ji )|()|(
© Prof. Rolf Ingold
26
Minimum Risk for Two Classes
Let ij ij be the loss of action i when the true state is j
The conditional risks of each decision is expressed as
Then, the optimal decision rule becomes
or equivalently
And in the case of 11 22
)|()|()|(
)|()|()|(
2221212
2121111
xxx
xxx
PPR
PPR
212221211121 δδ)|(ω)λ(λ)|(ω)λ(λ elsedecideif xx PP
211
2
1121
2212
2
1 δδ)(ω
)(ω
λλ
λλ
)ω|(
)ω|(elsedecideif
P
P
p
p
x
x
211
2
21
12
2
1 δδ)(ω
)(ω
λ
λ
)ω|(
)ω|(elsedecideif
P
P
p
p
x
x
21212121 δδ)|(ωλ)|(ωλ elsedecideif xx PP
© Prof. Rolf Ingold
27
Discriminant Functions
In the case of multiple classes a pattern classifier can be specified by a set of discriminant functions gi(x) such that the decision i corresponds to
Thus, a Bayesian classifier is naturally represented by
The choice of discriminant functions is not unique
gi(x) can be replaced by f (gi(x)) for any monotonic increasing
function f(x) A minimum error-rate classifier can be obtained with
ijgg jii )()( xx
)|()( xx ii Rg
)(ln)|(ln)(
)()|()(
iii
iii
Ppg
Ppg
xx
xx
© Prof. Rolf Ingold
28
Bayesian Rule in Higher Dimensions
The Bayesian rule can easily be generalized to the multidimensional case, where features are represented by a vector x.
where
)(
)()|()|(
x
xx
p
PpP ii
i
i
ii Ppp )()|()( xx
© Prof. Rolf Ingold
29
Conclusion about Bayesian Decision
Bayesian decision theory provides a theoretical framework for statistical pattern recognition
This theory supposes the following probabilistic information to be known: the number of classes a priori probabilities of each class class dependent feature distributions for each class
The remaining problem is: how to estimate all these things feature distributions are hard to be estimated priors are seldom known even the number of classes is not always given
© Prof. Rolf Ingold
30
Performance Evaluation
Performance evaluation is a very important issue of PR it gives an objective measure of the performance it allows to compare different methods
Performance evaluation requires correctly labeled test data test data should be different from training data a strategy consists in cyclically using 80% of the data for
training, and the remaining 20% for evaluation
© Prof. Rolf Ingold
31
Performance Measures: Recognition / Error Rates
Performance evaluation uses several measures recognition rate corresponds to the ratio
number of correct answers / number of total answers error rate corresponds to the ratio
number of incorrect answers / number of total answers rejection rate corresponds to the ratio
number of rejections / number of total answers
recognition rate = 1 – (rejection rate + error rate)
© Prof. Rolf Ingold
32
Performance Measures: Recall & Precision
On binary decisions (a sample belongs to the class or not) two other measurements are frequently used recall corresponds to the ratio of correctly assigned samples to
the size of the class precision corresponds to the ratio of correctly assigned
samples to the number of assigned samples Recall and precision are changing in opposite directions
equal error rate is sometimes considered to be the best trade-off
Additionally, the harmonic mean of precision and recall, calledF-measure is frequently used
recallprecisionrecallprecision2
measureF