21
European Conference on Quality in Official Statistics Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions A. Ferrillo, S. Macchia, P. Vicari [email protected] , [email protected] , [email protected] ISTAT – Italian Institute of Statistics Rome, 8 – 11 July 2008

European Conference on Quality in Official Statistics Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions

Embed Size (px)

Citation preview

European Conference on Quality in Official Statistics

Different Quality Tests on the Automatic Coding Procedure for the Economic Activities

Descriptions

A. Ferrillo, S. Macchia, P. [email protected], [email protected], [email protected]

ISTAT – Italian Institute of Statistics

Rome, 8 – 11 July 2008

The automated coding system used in IstatACTR

Automatic Coding by Text RecognitionDeveloped by Statistics Canada.

It is a generalised system = independent from the classification and the language

To be used, it is necessary to customise it to build the dictionary (reference file), define synonymous and adapt it to language.

The construction of the coding dictionary is the heaviest activity, since its quality deeply affects the performance of automatic coding.

ACTR

The coding activity is preceded by a quite sophisticated text standardisation phase, called “parsing”, providing 14 different “parsing functions” (character mapping, deletion of trivial words, definition of synonymous, suffixes removal, etc…) able to remove grammatical or syntactical differences so that any two different descriptions, with the same semantic content, become identical.

The parsed response to be coded is then compared to the parsed descriptions of the dictionary. If by this search, a perfect match is found, that is a “direct matching” (score = 10) is realised, then a unique code is assigned, otherwise the software runs an algorithm to look for the most suitable partial matches (“indirect matching”).

ACTR

As a result the software returns: unique matches, when a unique code is

assigned to a response phrase multiple matches, when several possible

codes are proposed failed matches, when no matches are found

Its performances are measured through two indicators:

Recall rate (coding rate) percentage of codes automatically assigned

Precision rate percentage of correct codes automatically assigned

Automated coding applications developed in Istat

The most important applications built in Istat, already used in different surveys, are referred to the following variables: Occupation Economic Activities Education level Country/Nationality Municipalities.

The coding rate obtained for Economic Activities varies from 50% for households surveys to 80% for business surveys.

ATECO 2007 – The new economic activities classification

ATECO 2007 is the national version of NACE Rev. 2, the European economic activities classification

The new NACE is deeply different from the previous one

NACE and its impact on the official statistics: the four digit codes that split in two or more new codes are 45%;

the five digit codes that split are 35%

ATECO 2007 and ACTR

The ACTR application updating for ATECO 2007 was a complex process made of different steps and problems:

only a part of the old classification at five digit level (around the 65%) directly translated in the new one. The other part had to be checked description by description,

since the classification was very different, some descriptions have been completely re-examined; in some case it was necessary to divide old descriptions (for example: “Repair and installation of pumps”) because a part is now in a code (Repair, group 33.1) and the other part is in a different code (Installation, group 33.2),

completely new activities were introduced, it was necessary to delete some old descriptions because

completely obsolete (281 texts).

ATECO and ACTR

Texts in the dictionary

ATECO ‘91 27,306

ATECO 2002 30,745

ATECO 2007 33,587

ACTR: aims of the new application

ACTR for surveys and Census ACTR was already used for Census 2001 and other surveys, it is already set on the descriptions given from this type of respondents

ACTR for administrative sourcesThese descriptions are different from those of statistical surveys because they are quite often very long and there are no specifications or rules on how to describe the company’s activity These texts have been treated in a specific way in order: a) to obtain descriptions shorter than the original ones, b) to delete redundancies and useless information

ACTR on webAs a new tool for all the users in order to find their economic activities code

Quality tests for ACTR 2007

In order to measure the quality of the procedure to be used to code not homogeneous descriptions, different quality tests have been planned. They are different both for the methodologies they use and the samples they treat.

Three tests will be described: for two of them, the correctness of codes assigned by the automatic coding application is stated by the analysis of expert coders, while in the third one the assigned codes are compared to codes deriving from some special surveys.

hn

Classes of occurrence

s

Number of original texts

Number of different texts

( )

Hypothesisedprecision of autom. coding

(π)

Margin of error ( )

Approximate optimal sample

size ( )

Sampling fraction

( )

1 169,420 169,420 75.0% ±0,040 234 0.14%

2 55,790 27,895 75.0% ±0,040 233 0.84%

3-8 92,401 20,906 75.0% ±0,035 305 1.63%

9-25 91,828 6,475 75.0% ±0,035 305 5.19%

26-50 65,636 1,854 75.0% ±0,030 414 23.20%

51-90 60,638 902 75.0% ±0,030 414 47.35%

91-130 38,941 360 75.0% ±0,030 360 100.00%

131-180 38,105 247 75.0% ±0,030 247 100.00%

181-300 56,913 246 75.0% ±0,020 246 100.00%

301-430 46,462 128 75.0% ±0,020 128 100.00%

431-730 78,602 139 75.0% ±0,010 139 100.00%

731-1410 83,200 83 75.0% ±0,010 83 100.00%

1,411 252,726 83 75.0% ±0,010 83 100.00%

Tot. 1,130,662 228,738 3,191 1.4%

hN hhh Nnf

1) Quality test on descriptions of the Industry Census A stratified random sample has been extracted from the 1,130,662 descriptions. The methodology adopted in drawing this sample optimises the analysis of results, so as to examine only once very similar texts (D’Orazio, Macchia 2002). Texts were stratified according to their frequency of occurrence; then, within each stratum, a simple random sample of texts was selected.

1) Quality test on descriptions of the Industry Census

The recall rate (78.47%) is absolutely satisfactory, also if analysed in details. As a matter of fact, while the Unique matches are distributed among all the classes, there are not Failed matches in classes over 180 occurrences.

In addition, the 71.25% of Unique matches have a score equal 10, which means that they correspond to direct matches, and more than the 53% of them belong to classes of occurrences greater than 91, which means that the dictionary enrichment was made consistently with the way respondents are used to express themselves.

Results in terms of recall rate N. % Direct matches

(score=10)

Unique 2,504 78.47 1,784 71.25

Multiple 642 20.12

Failed 45 1.41

Total 3,191 100.00

1) Quality test on descriptions of the Industry Census

Results in terms of precision ratePrecision rate

N %

C = Correct codes 2,382 95.13

W = Wrong codes 36 1.44

D = Codes assigned to descriptions impossible to be coded 86 3.43

2,504 100.00

As it can be seen, the precision rate is higher than 95% and, if analysed per score, the 98.09% of direct matches are corrected, which is surely a satisfactory result. In addition, it has been verified that the percentage of correct and non correct codes is uniformly distributed among all the classes of occurrences.

To state the precision, coded descriptions of this sample were submitted to expert coders

In order to update the Business Register Istat used different methodologies and sources:

ACTR was involved in the analysis of the descriptions of the Chamber of Commerce. A recall rate 61%, corresponding to 84,117 coded descriptions, was obtained.

Sector Studies are an administrative source that covers more than the 70% of the Business Register; the quality of this source is particularly high. Sector Studies assign a five digit code through a specific methodology not based on the text analysis.

2) Quality test on descriptions of the Chambers of Commerce

For this test:

The descriptions corresponding to codes at maximum level of detail were extracted from the ACTR coded dataset.

They were compared with those assigned through the Sector Studies (it was assumed that coinciding codes had to be considered correct as two different methodologies came to the same conclusion).

The results showed that the 67% of the extracted descriptions had equal codes, which can be considered a good indicator of quality.

The quality analysis regarded the remaining descriptions corresponding to different codes, but, due to its huge quantity (17,746 descriptions), a sample has been extracted.

Due to the characteristics of these texts, it was not considered suitable to adopt the same sampling strategy used for the first quality test. So, frequency classes of descriptions with not coinciding codes have been defined and then a sample has been extracted proportionally within each class, with a margin of error of ±0.014%.

2) Quality test on descriptions of the Chambers of Commerce

2) Quality test on descriptions of the Chambers of Commerce

Quality controlsample

ACTR codes – coinciding first 4 digits 2,306 13.0 520

ACTR codes – coinciding first 3 digits 3,042 17.1 686

ACTR codes – coinciding first 2 digits 4,185 23.6 943

ACTR codes – coinciding first digit 5,040 28.4 1,136

ACTR codes – not coinciding 3,173 17.9 715

17,746 4,000

Comparison between code assigned through ACTR and through Sector Studies

Quality control sample

2) Quality test on descriptions of the Chambers of Commerce

Precision rate

To state the precision, coded descriptions of this sample were submitted to expert coders who classified them as:

(A) correct codes according to ACTR (C) correct codes according to Chambers of Commerce (E) wrong codes according to both the methodologies (D) doubt codes according to both the methodologies

Quality control sample A C E D

n % n % n % n %

Coinciding codes first 4 digits 520 489 94 23 4 8 2 0 0

Coinciding codes first 3 digits 686 592 86 87 13 7 1 0 0

Coinciding first 2 digits 943 920 98 8 1 1 0 13 1

Coinciding first digits 1,136 351 31 119 10 666 59 0 0

Not coinciding codes 715 569 80 99 14 10 1 37 5

4,000

As it can be seen the precision rate is high, from 80% to 94% in all classes, apart from that coinciding only with the first digit (This is due to the fact that this class is widely populated with very generic descriptions owing to the Construction sector, which has been strongly revised in the new classification).

3) Quality test on special surveys

When ATECO 2007 was almost finalised, it became evident that it was necessary to realize specific surveys in those sectors where information was not available or the activities included were completely new. Particularly, it was decided to send a questionnaire to the enterprises in the fields of:

Information and Communication (section J); Architectural and engineering activities; technical testing and

analysis (division 71); Research and experimental development on natural sciences

and engineering (group 72.1); Specialised design activities (group 74.1); Services to buildings and landscape activities (division 81); Other professional, scientific and technical activities n.e.c.;

Office administrative and support activities; Business support services activities n.e.c. (74.9; 82.1; 82.9).

3) Quality test on special surveys

The surveys were sent to around 45 thousands enterprises: all the enterprises larger than 10 employees and a sample of the smallest enterprises (1 – 9 employees).

The questionnaires were very simple. At the beginning of every questionnaire a description of the economic activity not longer than 200 bytes was required.

3) Quality test on special survey

The respondents were around 30%.

In order to realize a quality test on these activities only the questionnaires where it was possible to attribute an ATECO code, analysing the answers to the survey, were considered (around 52%).

The coding rate was not so good (44.5%), but it was not considered a failure as the survey regarded specific sectors for which it was already known that the dictionary had to be enriched. On the other hand, the precision rate was quite high (88,2%).

The main purpose of the survey was to enrich the ACTR dictionary in order to improve the quality in terms of performance.

Conclusions and results

The performances of the application are satisfactory both in terms of recall rate and precision rate

The ACTR on WEB tool is having a big success (in the first weeks an average of 9,000 queries)

The dictionary continues to be enriched using both descriptions given by ACTR on WEB and those written in the special surveys questionnaires