23
Using Word Based Features for Using Word Based Features for Word Clustering Word Clustering The Thirteenth Conference on Language The Thirteenth Conference on Language Engineering 11-12, December 2013 Engineering 11-12, December 2013 Department of Electronics and Communications, Faculty of Engineering Cairo University Research Team : Farhan M. A. Nashwan Prof. Dr. Mohsen A. A. Rashwan Presented By : Farhan M. A. Nashwan

Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Embed Size (px)

DESCRIPTION

Generated Image word Preprocessing and Word segmentor Word Grouping Clustering Groups and Clusters for Holistic Recognition Proposed Approach: 3 The Thirteenth Conference on Language Engineering 11-12, December 2013

Citation preview

Page 1: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Using Word Based Features for Word Using Word Based Features for Word ClusteringClustering

The Thirteenth Conference on Language The Thirteenth Conference on Language Engineering 11-12, December 2013Engineering 11-12, December 2013

Department of Electronics and Communications, Faculty of Engineering

Cairo University

Research Team:Farhan M. A. NashwanProf. Dr. Mohsen A. A. Rashwan

Presented By:Farhan M. A. Nashwan

Page 2: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

ContributionContribution::Reduce vocabulary

Increase speed

2The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 3: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Generated Image

word

Preprocessing and Word segmentor

Word Grouping

ClusteringGroups and Clusters for

Holistic Recognition

Proposed ApproachProposed Approach::

3The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 4: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

GroupingGrouping::Extraction subwords

(PAW) Extraction dots and

diacritics Used it to select the

group4The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 5: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

GroupingGrouping::

5The Thirteenth Conference on Language Engineering 11-12, December 2013

Secondaries separation using contour analysis

Secondaries Recognition using

SVM

Grouping ProcessGroups

Preprocessing and Word segmentor

Generated Image Word

Page 6: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Grouping Grouping ExampleExample::

6

Grouping code (1,21,2)

Grouping Code (3,0, 2)

Grouping Code (4,11, 12)Grouping Code (3,2, 21)Grouping Code (2,0,

2)

PAW=1Upper Sec.=2

PAW=3Down Sec.=0

Upper Sec.=2

PAW=4

Down Sec.=1&1

Upper Sec.=1 & 2PAW=3Down Sec.=2Upper Sec.=2 &1PAW=2Down Sec.=0Upper

Sec.=2

Down Sec.= 2 & 1

The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 7: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

7

Challenges Sticking

Sensitive to noiseTreatments

PAWsDown secondaries Upper secondaries

Grouping based on:

Overlapping SVM

The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 8: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

ClusteringClustering:: Complementary of grouping LBG algorithm used Done on groups contain large words Euclidean distance used

8The Thirteenth Conference on Language Engineering 11-12, December 2013

Groups Feature Extracti

on

Clustering using LBG

Clusters & Groups

Page 9: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

FeaturesFeatures: : 1- (ICC): Image centroid and Cells2- (DCT):Discrete Cosine Transform 3- (BDCT):Block Discrete Cosine Transform 4-(DCT-4B): Discrete Cosine Transform 4-Blocks 5- (BDCT+ICC):Hybrid BDCT with ICC.6- (ICC+DCT): Hybrid DCT with ICC7- (ICZ):Image Centroid and Zone 8- (DCT+ICZ): Hybrid DCT and ICZ. 9- (DTW ):Dynamic Time Warping 10- The Moment Invariant Features

9The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 10: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

ResultsResults: : Features

Cluster Rate (%)

Total ER (%)

Clustering ER(%)

Group ER (%)

Word/Cluster

ICC98.71.310.550.75115BDCT96.83.222.470.75118DCT99.20.810.050.75129

DCT-4B98.71.30.550.75113ICC+BDCT98.31.660.910.75117ICC+ DCT99.00.980.230.75114

IZC96.73.282.530.75116IZC+DCT98.71.340.590.75115

DTW98.11.921.170.75154Moments82.617.3

916.640.75176

TABLE 1: CLUSTERING RATE OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES

10The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 11: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Features

Cluster Rate(%)

Word/ClusterFeat_Ext_Time (ms)

Clus_Ave_Time(ms)

To_Ave_Time (ms)

ICC98.7

1150.040.250.29

BDCT96.8

1180.380.130.51

DCT99.2

12911.950.0311.99

DCT-4B98.7

1131.870.021.90

ICC+BDCT98.3

1170.410.240.66

ICC+ DCT99.0

1141.900.262.16

IZC96.7

1160.010.040.05

IZC+DCT98.7

1151.870.061.94

DTW98.1

1450.054.044.09

Moments82.6

1760.130.150.29

TABLE 2: PROCESSING TIME FOR FEATURE EXTRACTION AND CLUSTERING OF SIMPLIFIED ARABIC FONT USING DIFFERENT

FEATURES

11The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 12: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

ConclusionConclusion:: based on their holistic features:Recognition speed increasedunnecessary entries in the vocabulary removedTotal average time of ICC or Moments (0.29 ms) is better than that of other methods.but the clustering rates are not the best (98.69% for ICC and 82.61% for Moment).the clustering rate of DCT (99.19%) is the better, but time is the worst (~12 ms).With two parameters (clustering rate and time) ICC may be a good compromise.

12The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 13: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Thanks for your Thanks for your attentionattention....

13The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 14: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

counting the number of black counting the number of black pixelspixels

Vertical Vertical transitiotransitions from ns from black to black to whitewhite

horizontahorizontal l transitiotransitions from ns from black to black to whitewhite

14The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 15: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

DCT

.-Applying DCT to the whole word image-The features are extracted in a vector form by using

the DCT coefficient set in a zigzag order.-Usually we get the most significant DCT

coefficients(160 coef.)

15The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 16: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

Block Discrete Cosine Transform (BDCT) Apply the DCT transform for Apply the DCT transform for

each celleach cell

Get the Get the average average of the of the differencdifferences es between between all the all the DCT DCT coefficiecoefficientsnts

16The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 17: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

Discrete Cosine Transform 4-Blocks (DCT-4B)

1 -Compute the center of gravity of the input image.2 -Divide the word image into 4-parts taking the center of

gravity as the origin point.3 -Apply the DCT transform for each Part.

4 -Concatenate the features taken from each part to form the feature set of the given word.

17The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 18: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

Image Centroid and Zone (ICZ)Compute the average distance among these

points (in a given zone) and the centroid of the word image

18The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 19: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

DTW (Dynamic Time Warping) Features .

The three types of features are extracted from the binarized images and used in our DTW techniques :

X-axis and Y-axis Histogram ProfileProfile Features(Upper, Down, Left and Right)Forground/Background Transition

DTW) is an algorithm for measuring similarity between two sequencesThe distance between two time series x1 . . . xM and y1 . . . yN is D(M,N), that is calculated in a dynamic programming approach using

19The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 20: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

DTW (Dynamic Time Warping) Features .

20The Thirteenth Conference on Language Engineering 11-12, December 2013

Figure 1: The Four Profiles Features: (A) Left Profile. B) Up (C) Down Profile. D) Right Profile

Page 21: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

The Moment Invariant Features

Hu moments: Hu defined seven values, computed from central moments through order three

21The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 22: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back22The Thirteenth Conference on Language Engineering 11-12, December 2013

Page 23: Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Go Back

Moments

23The Thirteenth Conference on Language Engineering 11-12, December 2013

The moment invariant descriptors are calculated and fed to the feature vector. 16

12