35
Corpora Linguistics 23.08 .04 1 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Embed Size (px)

Citation preview

Page 1: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 1

The Corpógrafo

Belinda Maia & Luís Sarmento PoloFLUP

LINGUATECA

Page 2: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 2

A bit of history

• PALC ’97 – 'Do-it-yourself corpora ... with a little bit of help from your friends!'

• CULT 1998 - ‘Making corpora – a learning process’

Contrastive linguistics Corpora linguistics Translation teaching

General > specific language

Page 3: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 3

A bit of history

• 2000 – 1st Master’s in Terminology and Translation at FLUP

• PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’

Specialized translation and terminology

Contact with domain experts

Importance of IT Need for technical

help for more ambitious students!

Page 4: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 4

A bit of history

• LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’

• 2002 – 2nd Master’s in Terminology and Translation at FLUP

Plea for help to Diana Santos

October 2002

LINGUATECA - Polo FLUP

Page 5: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 5

LINGUATECA

• See http://www.linguateca.pt

• Leader > Diana Santos (SINTEF – Oslo)

• Objective - to create resources and tools for the computational processing of Portuguese

• Nodes at Oslo, Lisbon, Braga and Porto

• Porto - Polo CLUP/FLUP

Page 6: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 6

Polo CLUP/FLUPGeneral focus

• See http://www.linguateca.pt/poloclup/• On constructing resources specific to the

needs of FLUP/CLUP – For researchers, teachers and students – For teaching methodology at FLUP

BNC & Reuter’s corpora on intranet A small ‘chat’ corpus Comparable corpora

Page 7: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 7

More history

• 2003 – Poster of the GC – at CL2003• 2003 – ‘What are comparable corpora?’

CL2003• 2003 – Experimentation with evaluation of

Machine Translation • 2003 – Experimentation with GC• 2003 – 3rd Master’s in Terminology and

Translation at FLUP

Page 8: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 8

Polo CLUP/FLUPResearch focus

• See http://www.linguateca.pt/poloclup/• On-line suite of corpora tools to work with

comparable corpora with emphasis on bilingual research– Focus on special domains – Construction of terminology databases,

ontologies and domain models

Corpógrafo

Page 9: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 9

And ...

• Evaluation of Machine Translation – Experimentation with evaluation – Teaching + research focus– Tools for collecting empirical data

• Results: – TrAva – MT evaluation tool

– CorTA – Corpus of 1 EN input + 4 MT output sentences

Page 10: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 10

The Corpógrafo results from:

• Terminology, translation and language study and research (Belinda)

• Computational linguistics research and production of resources (Diana)

• Information retrieval and artificial intelligence (Luís)

• Terminology data (Domain experts)= Discussions on priorities!

Page 11: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

11Corpora Linguistics 23.08.04

GC – Integrated Web Environment for Corpora LinguisticsMotivation

• Lack of Comprehensive, wide-scope Corpora Tools • Commercial Packages are usually difficult to Integrate/Customize• Tools are not prepared to support cooperative work.• Linguistic knowledge is not usually integrated in tools.

What is GC?GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to:

• access several Corpora tools from a single entry point using a regular web browser

• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)

• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)

• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)

• communicate and exchange results with other usersInternet Integration

GC provides seamless integration with the World Wide Web allowing users to:

• search specific Corpora resources on the Internet

• query the web for concordances

• use available translation-engines in parallel.

DOC HTML

TXT

PSPDF

RTF

BNCCETEMPúblico

COMPARA Others

PersonalCorpora

Custom Interface

DEV

Inter-userCommunication

ADMUSER

Administrator’s Tasks:

• Users, Groups and Disk Quotas

• Corpora Taxonomy (see box)

• Documentation Organization

• Access Service StatisticsVirtual

Desktop

Custom Interface Custom Interface Custom Interface

Tool Pool• Concordance Engine

• Taggers

• Aligner (Semi-Auto)

• Corpora Bot

• Statistics

• Custom Tools

InternetTerminology DB

• Medium: written, spoken, multimedia• Domain: Engineering, medicine, etc.• Genre: scientific, technical, informative, etc.

Corpora Taxonomy

Terminology Extraction Tool (Auto/Semi-Auto)

Developer Task:

Developer’s Tasks:

• Integrate Existing Tools/Resources

• Develop Additional Generic Tools

• Interact with Users/Administrator

• Develop Custom Tools for particular research needs

Inter-User Communication

• Tagging and Aligning Cooperatively

• Messaging Service

• Exchange of Corpora Resources

• Provide on-line tutorials

• Provide links to:

• on-line teaching material

• bibliography and other resources

Teacher’s Tasks:

Page 12: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 12

Working with the Corpógrafo

• Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research

• All research done ONLINE• Each username/password = separate space on our

server• At present > anyone can work with it using 10 MB

space for FREE• BUT - you get an empty space + tools + tutorial!

Page 13: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 13

Corpora and Terminology

• Special Domain Corpora

• Terminology extraction

• Terminology databases

• Structuring of domain knowledge

• Further corpora and information retrieval

Page 14: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 14

CorporaCorpora Analysis

TerminologyDatabase

InternetInternet

Text details Text details Text details

Page 15: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 15

Terminology Prescription or Description?

• Prescriptive > descriptive

• Paper > digital form

• Static > dynamic resources

• ‘Democratization’ of terminology

• ISO standards > socioterminology

• Knowledge structures increasingly recognized as structured but dynamic

Page 16: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 16

Perspectives of terminology users

• Domain experts and vested interests

• Translators • Information retrieval• Knowledge

engineering

Standardized terminology

The ‘right word’ Finding information Perfecting Google

Structuring knowledgeFinding it fast

Page 17: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 17

Bridging the Gap

• General linguists• Translation teachers• Translation students• Corpus linguists• Computational

linguists• Computer engineers

Computer-phobia

Computer-worship

Page 18: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 18

Focus of Corpógrafo

• Design priorities are to:– See the Big Picture– Create the Overall Framework– Get feedback from users – Develop according to real research needs– Fill in details and improve techniques as needed

Page 19: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 19

Page 20: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 20

File Manager

Area where each individual or group can:– Upload texts to space on server– Convert various text formats to .txt– ‘Clean’ them of unnecessary material– Check tokenization and sentence divisions– Register full information on source, domain

and text type– Group – and re-group - texts into corpora

Page 21: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 21

General corpus analysis

• Concordancing tools allowing for – Concordancing at sentence level– KWIC concordancing– Collocations

• N-gram tool– Case-sensitive– Alphabetical or frequency ordering

Page 22: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 22

Corpora + TDB

• Choose corpus

• Choose related TDB

= All terms, examples, definitions extracted (semi) automatically from corpus and transferred to TDB

= All metadata on texts providing data can be automatically transferred to TDB

Page 23: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 23

Term extraction

• N-grams– Unfiltered– Filtered with restrictions on term in PT EN FR

IT ES DE– Filtered with restrictions on term and context in

PT EN FR IT ES DE– Singular + plural terms can be combined– Existing terms in TDB need not appear

Page 24: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 24

Term selection from n-grams

• Consultation of list of n-grams

• Check term status of each n-gram via underlying concordances

• Check sources

• Send to TDB

Page 25: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 25

Search for Candidates for Definitions

and/or Semantic Relations

• Already possible via TDB

• Under development

• Research areas for Mestrado dissertations and research assistants– Expressions that find definitions– Expressions that find semantic relations

Page 26: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 26

TDB - Terminology database

Databases are designed to be multilingual– Terms listed alphabetically + language tag– General data– Morphological data– Source metadata: Authors, texts etc– Definitions + search for candidates– Translation equivalents– Semantic relations

Page 27: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 27

Future developments

• General testing and improvement• Development of new ideas or functions• Isomorphic relationship between:

– Research possibilities

– Researchers’ needs

– Our skills

• Coordination of individual corpus projects into bigger projects, when possible or necessary

Page 28: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 28

Theoretical questions / problems

• How large is a good domain corpus?

• Comparable corpora v. Parallel corpora?

• How much information does a database need – for information retrieval and knowledge engineering?

• How much does the user of a database need – for translation, teaching etc.?

Page 29: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 29

Corpógrafo and special domains

• Master’s in Terminology and Translation• Terminology projects with the support of domain

specialists in:– Engineering – Electronics, Mechanical Engineering

– Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion,

– Medicine - Kidney support machines, Neurology

– Science – Genetics

– Technology – GPS – Geographical Positioning Systems

Page 30: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 30

Corpógrafo and terminology/translation research

• Ongoing dissertations on aspects of:– Terminology – neologisms, definition searches,

semantic relations, conceptual analysis

– Corpora – text analysis, corpora construction

– Technical writing > Electrical Appliances

– Localization

– Terminology in documentaries

– Translation of Multimedia

Page 31: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 31

Linguateca

• Linguateca’s policy - all resources and tools freely available online

• Primary users - Portuguese and Brazilian

• Other users also welcome

Page 32: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 32

Polo CLUP/FLUP

• Bi- or multi-lingual in interest

• Corpógrafo available for experiments on a small scale to the general public

• Possibilities of future work on projects with users from other universities and other countries

Page 33: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 33

Corpógrafo team

• Belinda Maia - FLUP -Associate Professor• Luís Sarmento - Linguateca, FCCN – Computer

Engineer - Researcher-in-charge• Luís Miguel Cabral - Linguateca, FCCN –

Computer Engineer, Research assistant• Débora Oliveira - Linguateca, FCCN – Research

assistant• Ana Sofia Pinto – FLUP – technical assistant

Page 35: Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics 23.08.04 35