Upload
jose-chute
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Corpora Linguistics 23.08.04 1
The Corpógrafo
Belinda Maia & Luís Sarmento PoloFLUP
LINGUATECA
Corpora Linguistics 23.08.04 2
A bit of history
• PALC ’97 – 'Do-it-yourself corpora ... with a little bit of help from your friends!'
• CULT 1998 - ‘Making corpora – a learning process’
Contrastive linguistics Corpora linguistics Translation teaching
General > specific language
Corpora Linguistics 23.08.04 3
A bit of history
• 2000 – 1st Master’s in Terminology and Translation at FLUP
• PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’
Specialized translation and terminology
Contact with domain experts
Importance of IT Need for technical
help for more ambitious students!
Corpora Linguistics 23.08.04 4
A bit of history
• LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’
• 2002 – 2nd Master’s in Terminology and Translation at FLUP
Plea for help to Diana Santos
October 2002
LINGUATECA - Polo FLUP
Corpora Linguistics 23.08.04 5
LINGUATECA
• See http://www.linguateca.pt
• Leader > Diana Santos (SINTEF – Oslo)
• Objective - to create resources and tools for the computational processing of Portuguese
• Nodes at Oslo, Lisbon, Braga and Porto
• Porto - Polo CLUP/FLUP
Corpora Linguistics 23.08.04 6
Polo CLUP/FLUPGeneral focus
• See http://www.linguateca.pt/poloclup/• On constructing resources specific to the
needs of FLUP/CLUP – For researchers, teachers and students – For teaching methodology at FLUP
BNC & Reuter’s corpora on intranet A small ‘chat’ corpus Comparable corpora
Corpora Linguistics 23.08.04 7
More history
• 2003 – Poster of the GC – at CL2003• 2003 – ‘What are comparable corpora?’
CL2003• 2003 – Experimentation with evaluation of
Machine Translation • 2003 – Experimentation with GC• 2003 – 3rd Master’s in Terminology and
Translation at FLUP
Corpora Linguistics 23.08.04 8
Polo CLUP/FLUPResearch focus
• See http://www.linguateca.pt/poloclup/• On-line suite of corpora tools to work with
comparable corpora with emphasis on bilingual research– Focus on special domains – Construction of terminology databases,
ontologies and domain models
Corpógrafo
Corpora Linguistics 23.08.04 9
And ...
• Evaluation of Machine Translation – Experimentation with evaluation – Teaching + research focus– Tools for collecting empirical data
• Results: – TrAva – MT evaluation tool
– CorTA – Corpus of 1 EN input + 4 MT output sentences
Corpora Linguistics 23.08.04 10
The Corpógrafo results from:
• Terminology, translation and language study and research (Belinda)
• Computational linguistics research and production of resources (Diana)
• Information retrieval and artificial intelligence (Luís)
• Terminology data (Domain experts)= Discussions on priorities!
11Corpora Linguistics 23.08.04
GC – Integrated Web Environment for Corpora LinguisticsMotivation
• Lack of Comprehensive, wide-scope Corpora Tools • Commercial Packages are usually difficult to Integrate/Customize• Tools are not prepared to support cooperative work.• Linguistic knowledge is not usually integrated in tools.
What is GC?GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to:
• access several Corpora tools from a single entry point using a regular web browser
• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)
• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)
• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)
• communicate and exchange results with other usersInternet Integration
GC provides seamless integration with the World Wide Web allowing users to:
• search specific Corpora resources on the Internet
• query the web for concordances
• use available translation-engines in parallel.
DOC HTML
TXT
PSPDF
RTF
BNCCETEMPúblico
COMPARA Others
PersonalCorpora
Custom Interface
DEV
Inter-userCommunication
ADMUSER
Administrator’s Tasks:
• Users, Groups and Disk Quotas
• Corpora Taxonomy (see box)
• Documentation Organization
• Access Service StatisticsVirtual
Desktop
Custom Interface Custom Interface Custom Interface
Tool Pool• Concordance Engine
• Taggers
• Aligner (Semi-Auto)
• Corpora Bot
• Statistics
• Custom Tools
InternetTerminology DB
• Medium: written, spoken, multimedia• Domain: Engineering, medicine, etc.• Genre: scientific, technical, informative, etc.
Corpora Taxonomy
Terminology Extraction Tool (Auto/Semi-Auto)
Developer Task:
Developer’s Tasks:
• Integrate Existing Tools/Resources
• Develop Additional Generic Tools
• Interact with Users/Administrator
• Develop Custom Tools for particular research needs
Inter-User Communication
• Tagging and Aligning Cooperatively
• Messaging Service
• Exchange of Corpora Resources
• Provide on-line tutorials
• Provide links to:
• on-line teaching material
• bibliography and other resources
Teacher’s Tasks:
Corpora Linguistics 23.08.04 12
Working with the Corpógrafo
• Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research
• All research done ONLINE• Each username/password = separate space on our
server• At present > anyone can work with it using 10 MB
space for FREE• BUT - you get an empty space + tools + tutorial!
Corpora Linguistics 23.08.04 13
Corpora and Terminology
• Special Domain Corpora
• Terminology extraction
• Terminology databases
• Structuring of domain knowledge
• Further corpora and information retrieval
Corpora Linguistics 23.08.04 14
CorporaCorpora Analysis
TerminologyDatabase
InternetInternet
Text details Text details Text details
Corpora Linguistics 23.08.04 15
Terminology Prescription or Description?
• Prescriptive > descriptive
• Paper > digital form
• Static > dynamic resources
• ‘Democratization’ of terminology
• ISO standards > socioterminology
• Knowledge structures increasingly recognized as structured but dynamic
Corpora Linguistics 23.08.04 16
Perspectives of terminology users
• Domain experts and vested interests
• Translators • Information retrieval• Knowledge
engineering
Standardized terminology
The ‘right word’ Finding information Perfecting Google
Structuring knowledgeFinding it fast
Corpora Linguistics 23.08.04 17
Bridging the Gap
• General linguists• Translation teachers• Translation students• Corpus linguists• Computational
linguists• Computer engineers
Computer-phobia
Computer-worship
Corpora Linguistics 23.08.04 18
Focus of Corpógrafo
• Design priorities are to:– See the Big Picture– Create the Overall Framework– Get feedback from users – Develop according to real research needs– Fill in details and improve techniques as needed
Corpora Linguistics 23.08.04 19
Corpora Linguistics 23.08.04 20
File Manager
Area where each individual or group can:– Upload texts to space on server– Convert various text formats to .txt– ‘Clean’ them of unnecessary material– Check tokenization and sentence divisions– Register full information on source, domain
and text type– Group – and re-group - texts into corpora
Corpora Linguistics 23.08.04 21
General corpus analysis
• Concordancing tools allowing for – Concordancing at sentence level– KWIC concordancing– Collocations
• N-gram tool– Case-sensitive– Alphabetical or frequency ordering
Corpora Linguistics 23.08.04 22
Corpora + TDB
• Choose corpus
• Choose related TDB
= All terms, examples, definitions extracted (semi) automatically from corpus and transferred to TDB
= All metadata on texts providing data can be automatically transferred to TDB
Corpora Linguistics 23.08.04 23
Term extraction
• N-grams– Unfiltered– Filtered with restrictions on term in PT EN FR
IT ES DE– Filtered with restrictions on term and context in
PT EN FR IT ES DE– Singular + plural terms can be combined– Existing terms in TDB need not appear
Corpora Linguistics 23.08.04 24
Term selection from n-grams
• Consultation of list of n-grams
• Check term status of each n-gram via underlying concordances
• Check sources
• Send to TDB
Corpora Linguistics 23.08.04 25
Search for Candidates for Definitions
and/or Semantic Relations
• Already possible via TDB
• Under development
• Research areas for Mestrado dissertations and research assistants– Expressions that find definitions– Expressions that find semantic relations
Corpora Linguistics 23.08.04 26
TDB - Terminology database
Databases are designed to be multilingual– Terms listed alphabetically + language tag– General data– Morphological data– Source metadata: Authors, texts etc– Definitions + search for candidates– Translation equivalents– Semantic relations
Corpora Linguistics 23.08.04 27
Future developments
• General testing and improvement• Development of new ideas or functions• Isomorphic relationship between:
– Research possibilities
– Researchers’ needs
– Our skills
• Coordination of individual corpus projects into bigger projects, when possible or necessary
Corpora Linguistics 23.08.04 28
Theoretical questions / problems
• How large is a good domain corpus?
• Comparable corpora v. Parallel corpora?
• How much information does a database need – for information retrieval and knowledge engineering?
• How much does the user of a database need – for translation, teaching etc.?
Corpora Linguistics 23.08.04 29
Corpógrafo and special domains
• Master’s in Terminology and Translation• Terminology projects with the support of domain
specialists in:– Engineering – Electronics, Mechanical Engineering
– Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion,
– Medicine - Kidney support machines, Neurology
– Science – Genetics
– Technology – GPS – Geographical Positioning Systems
Corpora Linguistics 23.08.04 30
Corpógrafo and terminology/translation research
• Ongoing dissertations on aspects of:– Terminology – neologisms, definition searches,
semantic relations, conceptual analysis
– Corpora – text analysis, corpora construction
– Technical writing > Electrical Appliances
– Localization
– Terminology in documentaries
– Translation of Multimedia
Corpora Linguistics 23.08.04 31
Linguateca
• Linguateca’s policy - all resources and tools freely available online
• Primary users - Portuguese and Brazilian
• Other users also welcome
Corpora Linguistics 23.08.04 32
Polo CLUP/FLUP
• Bi- or multi-lingual in interest
• Corpógrafo available for experiments on a small scale to the general public
• Possibilities of future work on projects with users from other universities and other countries
Corpora Linguistics 23.08.04 33
Corpógrafo team
• Belinda Maia - FLUP -Associate Professor• Luís Sarmento - Linguateca, FCCN – Computer
Engineer - Researcher-in-charge• Luís Miguel Cabral - Linguateca, FCCN –
Computer Engineer, Research assistant• Débora Oliveira - Linguateca, FCCN – Research
assistant• Ana Sofia Pinto – FLUP – technical assistant
Corpora Linguistics 23.08.04 34
Contacts
If you are interested is finding out more, please contact me:
Belinda Maia at [email protected] Or
Luís Sarmento at [email protected] The Corpógrafo can be used
(with a username and password) at:http://www.linguateca.pt/corpografo and
http://poloclup.linguateca.pt/ferramentas/gc
Corpora Linguistics 23.08.04 35