31
Bilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

  • Upload
    dokhue

  • View
    226

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

Bilingual Terminology Extraction from TMX A State-of-the-Art Overview

Chelo Vargas-Sierra, PhD

University of Alicante,

Spain

Page 2: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

2

Key words Overview of terms

involved in the

process

1st point 2nd point 3rd point 4th point

Evaluation BATE under evaluation

Measures for accuracy

Quality in use model and tasks

Terminology and extractors Terminology management

Its timeline

BATE (approaches, state of the art)

Results Precision & Recall

Parameters & Questionnaire

INDEX Main points of this presentation

Page 3: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

Parallel corpus TMX

Alignment levels Paragraph, sentence and word

level

ATE & BATE

Precision/Recall Getting only terms and all terms

Gold standard Exhaustive, manually created

bilingual glossary

Validation * Term validation facility

* Which TCs are real terms?

Usability Software used to achieve

user’s objectives with

effectiveness, efficiency,

and satisfaction

Quality in use model ISO standard

KEY WORDS Terms involved in the process

Page 4: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

2. Terminology &

Extractors

Page 5: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

5

IDENTIFY

FIND RETRIEVE

the terminology in the source text adequately

Identify and interpret

terminological data

Retrieve and store

proper documentation and

information resources

Find and use

IMPORTANCE OF TERMINOLOGY Translators were the first professionals to be aware of term-related issues

Page 6: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

6 6

Time spent to solve terminological problems (Arntz 1993,

Walker 1993). +40%

In specialized translation

TERMINOLOGY MANAGEMENT

Page 7: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

7 7

Managing terminology (extracting, validating, importing, adding, editing, deleting,

revising, updating, exporting, publishing) is a time-comsuming process.

Time spent to solve terminological problems (Arntz 1993, Walker 1993). +40%

In specialized translation

TERMINOLOGY MANAGEMENT

Page 8: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

8 8

Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,

updating, exporting, publishing) is a time-comsuming process.

Time spent to solve terminological problems (Arntz 1993, Walker 1993). +40% In specialized translation

TERMINOLOGY MANAGEMENT

Terminology work is “on backstage”, and customer or

employers may not be fully aware of their befefits for QA.

Page 9: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

9 9

Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,

updating, exporting, publishing) is a time-comsuming process.

Time spent to solve terminological problems (Arntz 1993, Walker 1993). +40% In specialized translation

TERMINOLOGY MANAGEMENT

Return on Investment (ROI) on terminology management

reported by some corporate studies (Childress, 2007;

Popiolek, 2015)

90%

Terminology work is “on backstage”, and customer or employers may not be fully aware of

their befefits for QA.

Page 10: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

10 10

TERMINOLOGY MANAGEMENT

Extraction

• List of terms extracted from ST

• List of terms to validate (accept or reject)

Translation

• List is added to a termbase

• List is translated and additional data added

Approval

• List approved by a person in charge of terminology

• When the client has requested there is an addtional step for client approval

General model por project terminology

creation (Popiolek, 2015: 351) Monolingual

extraction &

validation

Importing &

looking for

equivalents

Page 11: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

11

Preparing the files and import

them into the BATE

Preparation: TMX import

List of candidate term pairs

extracted from TMX

Bilingual extraction

TIMELINE in Terminology Management with bilingual extraction

Page 12: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

12

- List of pair of terms to validate (accept

or reject terms and suggested

equivalents)

- Term by term and additional data are

added to a term base (Synchroterm)

Validation (& data entry)

- Export bilingual terms and additional

data in an available file format (.xls,

.txt, .TBX, …)

- Import output file to a TDB system

(to be integrated into a MT System)

Export/Import

Page 13: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

13

Person in charge of terminology

or client

Approval

Ready to use

Finish

Page 14: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

14

Bilingual Automatic Term Extractors Two approaches (Foo, 2012)

EXTRACT-ALIGN

1ST step: monolingual terminology extraction

in both languages.

2nd step: cross-linguistic matching using

word-alignment or co-occurrence statistics to

find equivalents.

Commercial systems in this approach

Page 15: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

15

ALIGN-FILTER

1ST step: word-alignment on the

parallel texts.

2nd step: rank the aligned units to

finally select the most likely pair of

candidates (statistics)

TExSIS (Macken et al, 2013)

Bilingual Automatic Term Extractors Two approaches (Foo, 2012)

Page 16: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

16

Bilingual Automatic Term Extractors Academic / In-house

- English-French TERMIGHT (Dagan & Church, 1994)

- English-French (Kupiek, 1993)

- English-Dutch (Eijk, 1993)

- English-French (Gaussier, 1995)

- English and Swedish (Ahrenberg et al., 1998)

- French-Japanese (Morin et al 2010, from

ACABIT, Daille, 2003): not bilingual, but

multilingual

- Slovene and English, Luiz (Vintar, 2010);

- English and Swedish ITools suite (Foo &

Merkel, 2010)

- English and German (Gojun et al., 2012).

- English, French, German, Spanish, TTC

TermSuite (Daille, 2012)

- English-Spanish TBXTools (Oliver &

Vázquez, 2015) (under development)

- Chinese, Czech, Dutch, English, French,

German, Italian, Japanese, Korean, Polish,

Portuguese, Russian, Spanish: Sketch

Engine (Baisa et al 2015, Koval et al 2016)

- French-German (Blank, 2000)

- Japanese-English, MNH (Nakagawa & Mori, 2003)

- Spanish-Basque, Elexbi (Hernaiz et al., 2006),

from a TMX;

- Spanish-German, Autoterm (Haller, 2008);

- English-Spanish, Mutual Bilingual Term

Extractor (Ha et al, 2008)

- French-English, French-Italian and French-Dutch

(Lefever et al., 2009)

90s

2000-2009

2010 -2016

Page 17: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

17

Bilingual Automatic Term Extractors Other BATE (free / comercial)

- TermExtractor (Shimohata et al 2001)

- MemoQ's built-in term extractor

- Déjà Vu - Lexicon

- TermoStat Web: http://termostat.ling.umontreal.ca/

- Yate (IULA)

- Okapi

- TerMine:

http://www.nactem.ac.uk/software/termine/

- TerminologyExtractor: https://goo.gl/yA2Cuf

- PRoMT

- FiveFilters (web-based): http://fivefilters.org/term-

extraction/

- Concordace programs: WordSmith Tools,

AntConc (free), …

90s 2010 -2016 MONOLINGUAL ATE

- Xerox Terminology Suite (2001)

- SDL Multiterm Extract

- Synchroterm

- CrossMining (Across)

- MultiTrans Term Extractor

- Similis™ (by Lingua et Machina™)

- Anchovy (by Swordfish)

- Araya Term Extractor

- Analysis software: Sketch Engine

(terminology extraction from TMX)

BILINGUAL

Page 18: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

3. Evaluation

Page 19: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

19 BATE UNDER EVALUATION Sketch Engine

SIMILIS

Multiterm Extract

Synchroterm

Page 20: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

20

Multiterm Extract

SynchroTerm Similis SkE Araya

Import TMX

Extraction config.

Extraction scores

Validation facility

Term base indexation

Export to TBX (xls, txt…)

Trados TMX

MAIN FEATURES

Others Others

Page 21: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

21

TERMS

NO TERMS

EXTRACTED NON-EXTRACTED

A B

C D

RECALL =𝐴

𝐴+𝐵

PRECISION =𝐴

𝐴+𝐶

MEASURES FOR ACCURACY

Page 22: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

Context coverage degree to which the

product understands the

complete context of its

usage. Flexibility

Effectiveness

accuracy and completeness

with which user achieves

objectives

Satisfaction

Efficiency resources expended in

relation to the accuracy and

completeness

Freedom from risk no risk for the security of

users, software, context or the

environment

degree to which user needs are

satisfied when a software is

used in a specified context of use

QUALITY IN USE MODEL Characteristics (ISO-IEC 25010: 2011)

Page 23: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

23

Setting up the

extraction project

CONFIGURATION

Importing the source file TMX IMPORT

Performing the

extraction to get a

bilingual list

EXTRACTION

Selecting the real terms. VALIDATION

Creating and managing

term entries

RECORD CREATION

Exporting the final result for

later use in CAT Systems

EXPORTATION

6 TASKS TO EVALUATE when performing bilingual extraction

Page 24: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

4. Results

Page 25: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

25

28.30

43.33

10.66

14.85

21.29

62.33

45.42

51.61

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

PRECISION RECALL

PRECISION & RECALL IN %

Sketch MTE Synchr Similis

EXTRACTED NON-EXTRACTED

TERMS NO TERMS TERMS NO TERMS GOLD

STANDARD TCs PRECISION RECALL

A C B D

Sketch 283 717 370 653 1,000 28,30 43,34

MTE 97 813 556 910 10,66 14,85

SynchroT. 407 1505 246 1,912 21,29 62,33

Similis 337 405 316 742 45,42 51,61

Page 26: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

26

Characteristics and sub-characteristics to be measured METRICS

EFFECTIVENESS Value between 0 (minimum) and 5 (maximum) (EFE1+EFE2+EFE3)/3

EFE1.- Degree of accuracy – precision of tasks & results (P1+P7+P13+P19+P25+P31)/6

EFE2.- Degree of completeness (tasks are accomplished and

results are not missing) (P2+P8+P14+P20+P26+P32)/6

EFE3.- Frequency of errors (P3+P9+P15+P21+P27+P33)/6

EFICIENCY Value between 0 (minimum) and 5 (maximum) (EFI2+EFI3+EFI4)/3

EFI1.- Time spent in the accomplishment of the task. (TM1+TM2+TM3+TM4+TM5+TM6)

EFI2.- Need to use additional sources (material, software, etc.)

for the task (P4+P10+P16+P22+P28+P34)/6

EFI3.- Productivity – effort exerted by the user to carry out the

task (P5+P11+P17+P23+P29+P35)/6

EFI4.- Need to consult the software Help to perform the task (P6+P12+P18+P24+P30+P36)/6

SATISFACTION Value between 0 (minimum) and 5 (maximum)

(P37+P38+P39)/3

SAT1.- Usefulness

SAT2.- Trust

SAT3.- Pleasure

CONTEXT COVERAGE Value between 0 (minimum) and 5 (maximum)

(P40+P41+P42)/3 COB1.- Context of use

COB2.- Flexibility

PARAMETERS

Page 27: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

27

QUESTIONNAIRE 42 questions grouped by tasks

Page 28: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

28

16

13 14

25

20

26

21

24

0

5

10

15

20

25

30

EXTRACTION VALIDATION

RESULTS FOR EXTRACTION & VALIDATION

Sketch MTE Synchr Similis

3.33 3.00

4.00 3.50

13.83

4.06 4.44

3.00

1.50

13.00

4.11 4.22 4.33

3.00

15.67

3.72 3.11 3.00 3.00

12.83

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

EFFECTIVENESS EFFICIENCY SATISFACTION CONTEXT COVERAGE TOTAL QIU

FINAL RESULTS FOR QUALITY IN USE

Sketch MTE Synchr Similis

Page 29: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

29

CONCLUSIONS

• Managing terminology still takes a lot of time and effort, even in

this increasingly computerized profession.

• Research on automatic terminology extraction has been

around for more than 20 years and significant enhancements

concerning bilingual extraction and bilingual corpora

exploitation have been introduced.

• I briefly described the BATE under evaluation and illustrated

some results obtained for accuracy and with the QIU model.

• Results make it clear that much more work has to be done for

BATE to be considered of real help to translators and

terminologists, mainly due to poor accuracy results.

Page 30: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

References

• Baisa, Vit, Barbora Ulipová, and Michal Cukr. 2015. “Bilingual Terminology Extraction in Sketch Engine.” In 9th

Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2015), 61–67.

• Childress, Mark D. 2007. “Terminology Work Saves More Time than It Cost.” Multilingual, no. April/May: 43–46.

• Foo, Jody. 2012. Computational Terminology : Exploring Bilingual and Monolingual Term Extraction.

• Foo, Jody; Merkel, Magnus. 2010. “Computer Aided Term Bank Creation and Standardization. Building Stardardize

Term Banks through Automated Term Extraction and Advanced Editing Tools.” In Terminology in Everyday Life,

edited by Marcel Thelen and Fireda Steurs, 163–80. John Benjamins Publishing Company. doi:

10.1075/tlrp.13.12foo.

• Kovář, Vojtěch, Vít Baisa, and Miloš Jakubíček. 2016. “Sketch Engine for Bilingual Lexicography.” International

Journal of Lexicography 29 (3): 339–52. doi:10.1093/ijl/ecw029.

• Macken, Lieve, Els Lefever, and Veronique Hoste. 2013. “TExSIS: Bilingual Terminology Extraction from Parallel

Corpora Using Chunk-Based Alignment.” Terminology 19 (2013): 1–30. doi:10.1075/term.19.1.01mac.

• Oliver, Antoni, and M. Vazquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology

Extraction.” In Proceedings of Recent Advances in Natural Language Processing, 473–79.

• Popiolek, Monika. 2015. “Terminology Management within a Translation Quality Assurance Process.” In Handbook

of Terminology (Volume 1), edited by Hendrik J Kockaert and Frieda Steurs, 341–59. John Benjamins Publishing

Company. doi:10.1075/hot.1.ter6.

• Sauron, Véronique. 2002. “Tearing out the Terms : Evaluating Terms Extractors.” In Translating and the Computer

24: Proceedings from the Aslib Conference, 21-22 November 2002.

• Vintar, Špela. 2010. “Bilingual Term Recognition revisited<BR> The Bag-of-Equivalents Term Alignment Approach

and Its Evaluation.” Terminology 16 (2010): 141–58. doi:10.1075/term.16.2.01vin.

Page 31: Bilingual Terminology Extraction from TMX · PDF fileBilingual Terminology Extraction from TMX A State-of-the-Art Overview Chelo Vargas-Sierra, PhD University of Alicante, Spain

University of Alicante

IULMA

Campus de San Vicente

Apdo. 99 03080 Alicante

Phone & Fax

Direct Line: +34 965903438

Fax: +34 965903800

[email protected]

Social Media

@chelovargas

Many thanks for your attention Chelo Vargas-Sierra