A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment

A fuzzy clustering approach to improve the accuracy of Italian students’data

An experimental procedure to correct the impact of the outliers on assessment test

scores

Claudio Quintano, Rosalia Castellano, Sergio LongobardiClaudio Quintano, Rosalia Castellano, Sergio Longobardi

UNIVERSITY OF NAPLES “PARTHENOPE” UNIVERSITY OF NAPLES “PARTHENOPE”

[email protected] [email protected] [email protected]@uniparthenope.it

[email protected] [email protected]

-2-

A fuzzy clustering approach to improve the accuracy of Italian students’dataAn experimental procedure to correct the impact of the outliers on assessment test scores

C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”

OUTLINE

This work considers data on students’ performance assessments collected by the

Italian National Evaluation Institute of the Ministry of Education (INVALSI)

• OUTLIER UNITSOUTLIER UNITS, at class level, which brings to biased , at class level, which brings to biased distributions of the average scores by class distributions of the average scores by class

• The AIM is to The AIM is to MITIGATE THE PRESENCEMITIGATE THE PRESENCE of outliers and of outliers and correcting the overestimation of children abilitycorrecting the overestimation of children ability

THE INVALSI SURVEY

3 AREASreading, mathematics and science

5 SCHOOL LEVELS

–2th and 4th year of primary school–1th year of lower secondary –1th and 3th year of upper secondary

-3-



VAR00005100,0080,0060,0040,0020,000,00

Fre

qu

en

cy

500

400

300

200

100

0

Histogram

avergita100806040200

Fre

qu

en

cy

1.250

1.000

750

500

250

0

HistogramMATHEMATICS CLASS MEAN SCORE - S.Y 2004/05

III CLASS UPPER

SECONDARY SCHOOL

I CLASS UPPER SECONDARY

SCHOOL

I CLASS LOWER

SECONDARY SCHOOL

IV CLASS PRIMARY SCHOOL


Fre

qu

ency

1.250

1.000

750

500

250

0

Histogram

II CLASS

PRIMARY SCHOOL

VAR00005100,0080,0060,0040,0020,000,00

Fre

qu

en

cy

500

400

300

200

100

0

Histogram

VAR00002100,0080,0060,0040,0020,000,00

Fre

qu

en

cy

2.000

1.500

1.000

500

0

Histogram

DISTRIBUTIONS OF MEAN SCORESAT CLASS LEVEL (MATHEMATICS ASSESSMENT)

-4-




Fre

qu

ency

3.000

2.000

1.000

0

Histogram

Mean =87,81Std. Dev. =8,14

N =30.031

avergita

100806040200

Fre

qu

ency

1.250

1.000

750

500

250

0

Histogram

Mean =74,71Std. Dev. =14,133

N =30.097

avergita

100806040200

Fre

qu

ency

2.500

2.000

1.500

1.000

500

0

Histogram

Mean =76,66Std. Dev. =14,615

N =30.094

CLASS MEAN SCOREII

CLA

SS

- P

RIM

AR

Y S

CH

OO

L


Freq

uenc

y

2.000

1.500

1.000

500

0

Histogram

Mean =77,72Std. Dev. =13,029

N =29.802

avergita

100806040200

Freq

uenc

y

2.000

1.500

1.000

500

0

Histogram

Mean =81,5Std. Dev. =11,439

N =29.816


Fre

qu

ency

5.000

4.000

3.000

2.000

1.000

0

Histogram

Mean =88,54Std. Dev. =8,832

N =29.831

Reading s.y. 2004/05 Mathematics s.y. 2004/05 Science s.y. 2004/05

Reading s.y. 2005/06 Mathematics s.y. 2005/06 Science s.y. 2005/06

-5-



Deletion of micro units –students- Deletion of micro units –students- considered as considered as

““PSEUDO NON RESPONDENTS”PSEUDO NON RESPONDENTS”

Students who haven’t given the minimum number of answers to compute a performance score

The presence of these units varies The presence of these units varies from 9% to 16%from 9% to 16%

STEP I

-6-



Class mean score Class mean score ::

N j

iji 1

jj

p

pN

COMPUTATION OF CLASS LEVEL INDICATOR

SCORE OF ISCORE OF ITHTH STUDENT OF J STUDENT OF JTHTH CLASSCLASS

NUMBER OF RESPONDENTNUMBER OF RESPONDENTSTUDENTS OF JSTUDENTS OF JTHTH CLASS CLASS

For each student class the For each student class the following indexes are computed:following indexes are computed:

Standard deviation of mean score

N j 2

jiji 1

jj

p p

N

N j

iji 1

jj

M

MCN Q

Class non response rateClass non response rateNUMBER BOTH OF ITEM NON NUMBER BOTH OF ITEM NON REPSONSES AND OF INVALID REPSONSES AND OF INVALID

RESPONSES FOR THE IRESPONSES FOR THE ITHTH STUDENT OF STUDENT OF THE JTHE JTHTH CLASS CLASS

NUMBER OF RESPONDENTNUMBER OF RESPONDENTSTUDENTS OF JSTUDENTS OF JTHTH CLASS CLASS

NUMBER OF ADMINISTERED NUMBER OF ADMINISTERED ITEMS TO JITEMS TO JTHTH CLASS CLASS

Q

sjs 1

J

E

EQ

Index of answers’ homogeneityIndex of answers’ homogeneity

GINI MEASURE OF HETEROGENEITY COMPUTED FOR EACH STH TEST QUESTION ADMINISTERED TO

EACH STUDENT OF JTH CLASS

SUMMARY

At first step the micro units At first step the micro units considered as “pseudo-non considered as “pseudo-non

respondents” have been respondents” have been dropped from dataset then dropped from dataset then the following indexes, at the following indexes, at

class level, are computed:class level, are computed:

Class mean scoreClass mean score

Standard deviation Standard deviation of mean scoreof mean score

Class non response rateClass non response rate

Index of answers’Index of answers’homogeneityhomogeneity

-7-



PRINCIPAL COMPONENT ANALYSIS (PCA)

By the PCA we are able to describe the

answer behaviour of each student class

through two variables

CONTRAPOSITION

FIRST ComponentSECOND

Component

OUTLIERSIDENTIFICATION

AXIS

INDEX OF CLASS COLLABORATION

TO SURVEY

Class non response rate

-8-



It is possible to detect, graphically, the outlier classes of students

Projection on the first two factorial

axes plane of second class

primary students

PRINCIPAL COMPONENT ANALYSIS (PCA)

OUTLIER CLASSES

-9-



Computation of Computation of fuzzy fuzzy partition matrixpartition matrix where for where for

each students’ class (rows of each students’ class (rows of the matrix) the degree of the matrix) the degree of belonging to each cluster belonging to each cluster (columns of the matrix) is (columns of the matrix) is

computed computed

THE FUZZY K-MEANS APPROACHFUZZY K-MEANS APPROACH

On the basis of the two On the basis of the two factorial dimensions the factorial dimensions the

students’classes are classified students’classes are classified in 8 clusters by a in 8 clusters by a FUZZY K-FUZZY K-

MEANSMEANS algorithm algorithm

-10-



DETECTION OF OUTLIERSDETECTION OF OUTLIERS

Projection of Projection of centroids computed centroids computed

by fuzzy k-meansby fuzzy k-means

High negative scores on High negative scores on “outliers identification “outliers identification

axis” (x-axis) that axis” (x-axis) that indicates a high class indicates a high class average scores and average scores and

minimum within minimum within variabilityvariability respect to respect to

scores and test answersscores and test answers

OUTLIER CLUSTER

Factorial scores close to zero Factorial scores close to zero respect to the “index of class respect to the “index of class

collaboration to survey”collaboration to survey”

-11-



Indicating with “a” the outlier cluster, the degree of belonging to this cluster is:

µja

Otherwise it can be interpreted as the “outlier level” of each class

This measure is considered as the “outlier probability” of jth class

DETECTION OF OUTLIERSDETECTION OF OUTLIERS

-12-



WWjj varies from 0 to 1 varies from 0 to 1 The students’ class with high probability to belong The students’ class with high probability to belong to outlier cluster will have a low weight while the to outlier cluster will have a low weight while the class very far from this cluster will have a weight class very far from this cluster will have a weight

close to 1close to 1

CORRECTION PROCEDURE

On the basis of the outlier cluster degree,On the basis of the outlier cluster degree, a weighting factor is developed:a weighting factor is developed:

WWj j =1 -=1 - µ µjaja

Weighting factor Outlier probability

-13-



EFFECTS OF THE CORRECTION PROCEDURE

ORIGINAL DISTRIBUTION ADJUSTED DISTRIBUTION

-14-



THE INSPIRATION PRINCIPLE

OUTLIEROUTLIER

NOT NOT OUTLIEROUTLIER

Go over the dichotomous logic

FUZZY FUZZY APPROACHAPPROACH

Compute an “OUTLIER LEVEL” measure for each unit to

calibrate the correction

-15-



RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES

Box plot of outlier level

µja

Degree to belonging to the outlier

cluster (cluster n.2)

-16-



CLASS AVERAGE CLASS AVERAGE SCORE SCORE

DISTRIBUTIONS ONLY DISTRIBUTIONS ONLY FOR THE NORTHERN FOR THE NORTHERN

AND CENTRAL AND CENTRAL REGIONSREGIONS

RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES

-17-



REGIONAL SCORES

50

52

54

56

58

60

62

64

66

68

70

72

74

76

78

80

82

84

86

Media

II elem MAT 0405

50

52

54

56

58

60

62

64

66

68

70

72

74

76

78

80

82

84

86

Piem

onte

Vall

e D'A

osta

Lom

bard

ia

Tre

ntino

Alto

Adig

e

Ven

eto

Friu

li Ven

ezia

Giulia

Ligu

ria

Em

ilia R

omag

na

Tos

cana

Um

bria

Mar

che

Laz

io

Abr

uzzo

Moli

se

Cam

pania

Pug

lia

Bas

ilicat

a

Cala

bria

Sici

lia

Sar

degn

a

Media ponderata

II elem MAT 0405

NOT WEIGHTED AVERAGE

WEIGHTED AVERAGE

-18-



denotes the ratio of students of jth class that has given the tth answer to sth question

Index of answers’ homogeneity

Index of answers’ homogeneityQ

EE

Q

ssj

j

1

Where Esj is a Gini measure of heterogeneity:

2

1

1

h

t j

tsj N

nE

j

t

N

n

The Gini measure is equal to zero when all students of jth class have given

the same answer to the sth question. It reaches the maximum value: h-1/h (h

is the number of alternative answers to question sth) when there is perfect

heterogeneity of answers to sth question in the jth class

The mean of the Q Gini indexes (Esj) computed for each sth test Question administered to each student of jth class:

-19-



Original distribution Adjusted distribution

MEAN 74,71 71,67

MODE 100,00 68,75

I QUARTILE 64,42 63,12

MEDIAN 73,61 71,09

III QUARTILE 85,94 80,69

KURTOSIS

SKEWNESS

EFFECTS OF THE CORRECTION PROCEDURE

Documents

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment