33
+ Lexical Resources for Portuguese Valeria de Paiva (joint work with Alexandre Rademaker, Gerard de Melo and Livy Real)

Lexical Resources for Portuguese

Embed Size (px)

DESCRIPTION

Slides describing the work that led to Alexandre Rademaker's presentation in GWC 2014

Citation preview

Page 1: Lexical Resources  for Portuguese

+

Lexical Resources for Portuguese Valeria de Paiva (joint work with Alexandre Rademaker, Gerard de Melo and Livy Real)

Page 2: Lexical Resources  for Portuguese

+WordNet?

http://wordnetweb.princeton.edu/

Page 3: Lexical Resources  for Portuguese

+Why this talk?...

Page 4: Lexical Resources  for Portuguese

+WordNet…

n  WordNet created at Princeton University under George A. Miller, since 1985. A lexical database for English: groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these.

n  This produces a combination of dictionary and thesaurus that is intuitive, usable, and supports automatic text analysis and artificial intelligence applications. Released under a BSD style license, can be downloaded and used freely.

n  WordNet is the most commonly used computational lexicon of English.

n  Some complaints that WordNet encodes sense distinctions that are too fine-grained even for humans. The granularity issue has been tackled by proposing clustering methods that automatically group together similar senses of the same word.

Page 5: Lexical Resources  for Portuguese

+Global WordNet Association n  Christiane Fellbaum and Piek Vossen

(EuroWordNet 1996-1999, GWA since)

n  The Global WordNet Association (GWA) is a free, public and non-commercial organization that provides a platform for discussing, sharing and connecting wordnets for all languages in the world.

n  Global WordNet Grid since 2006. Open Multilingual Wordnethttp://casta-net.jp/~kuribayashi/multi/ Francis Bond

n  A simple user interface: Welcome to the Open Multi-lingual Wordnet (1.0) http://casta-net.jp/~kuribayashi/cgi-bin/wn-multi.cgi

Page 6: Lexical Resources  for Portuguese

+Multilingual Wordnet 1.0

Page 7: Lexical Resources  for Portuguese

+ OpenWordnet-PT? (aren’t all wordnets open?)

We need a Portuguese Wordnet for our work, but none of the previous projects is openly available.

Previous work: WordNet.PT and WordNet.PT global (Lisboa), MultiWordNet.PT and Brazilian WordNet by Bento Dias.

Page 8: Lexical Resources  for Portuguese

+Previous Portuguese WordNets…

n  WordNet.PT (P. Marrafa) since 1999, part of EuroWordNet, 19K expressions, manually curated, online consulting only. Some domains

n  MWN.PT - MultiWordnet of Portuguese (A. Branco), since 2008, part of MWN, over 17,200 manually validated concepts/synsets, not free

n  WN.Br (B. Dias da Silva) since 2000, not open, not available online, REBECA 2010 only ‘wheeled vehicles’….

Page 9: Lexical Resources  for Portuguese

+ OpenWN-PT: What?

n  Leverage EuroWordNet, MultiWordNet, Global WordNet experience

n  Recruited Gerard de Melo for project

n  Leverage YAGO, UWN/Menta experience…

n  UWN/MENTA (de Melo/Weikum) A large-scale multilingual lexical knowledge base built using statistical methods, transforming WordNet into a massively multilingual resource (over 1 million words and several million named entities in a single large multilingual taxonomy)

n  Portuguese `projection’ of UWN/Menta is the basis of automated version of a OpenWordNet-PT, publicly available.

https://github.com/arademaker/wordnet-br

Page 10: Lexical Resources  for Portuguese

+ OpenWN-PT: the basis…

https://github.com/arademaker/wordnet-br

Combined the following data: Princeton WordNet 3.0 used to obtain English glosses and English terms for synset IDs. The unreleased 2010-12 version UWN and MENTA provided candidate terms in Portuguese, candidate glosses in Portuguese (from Wikipedia), and candidate terms in Spanish. The EuroWordNet base concept list (5000_bc.xml) provides the base concept numbers. The original file was mapped from WordNet 2.0 to 3.0 using the mappings from WN-Map. When multiple mappings for a WordNet 2.0 synset existed, all possible WordNet 3.0 synsets were kept.

http://nlp.lsi.upc.edu/web/index.php?option=com_content&task=view&id=21&Itemid=57

Page 11: Lexical Resources  for Portuguese

+OpenWN-PT: the method n  a two-tiered methodology: high precision for the

more frequent words of the language, but also high to cover a wide range of words in the long tail

n  Translation dictionaries to map the English members of a synset to possible Portuguese translation candidates. To disambiguate and choose the correct translations, feature vectors for possible translations are created by computing graph-based statistics in the graph of words, translations, and synsets. Monolingual wordnets and parallel corpora used to enrich this graph. Statistical learning techniques used to iteratively refine this information and build an output graph connecting Portuguese words to synsets.

n  Wikipedia pages are then linked to relevant WordNet synsets by learning from similar graph-based features as well as gloss similarity scores.

Page 12: Lexical Resources  for Portuguese

+More method…

n  To have high precision for the most important concepts of a language, rely on human annotators.

n  Set of 4689 “Common Base Concepts” GWA

n  2,498 manually entered sense-word pairs as well as an additional 1,299 manually written Portuguese synset glosses.

n  Does it work?

Page 13: Lexical Resources  for Portuguese

+OpenWN-PT: some numbers…

Column (3) synsets with portuguese words

Page 14: Lexical Resources  for Portuguese

+OpenWN-PT: some numbers…

But how good are these entries? How to measure? How to improve?

Page 15: Lexical Resources  for Portuguese

+ OpenWN-PT: what does it look like?

n  Typical good entry with minor manual improvements.

n  Automatic produces candidate Portuguese words for each of some of WN3.0 synsets.

n  Check suggested words and add Portuguese gloss and examples.

Page 16: Lexical Resources  for Portuguese

+ OpenWN-PT: what does it look like?

Good automatically suggestion

Not very useful

Page 17: Lexical Resources  for Portuguese

+OpenWN-PT: some issues…

Capitalized items, plurals, duplicates, a few gender issues, missing items…

Page 18: Lexical Resources  for Portuguese

+ OpenWN-PT: true lexical gaps?...

Page 19: Lexical Resources  for Portuguese

+ OpenWN-PT: manual revisions

Native speakers, but not linguists… Plenty of errors…

Page 20: Lexical Resources  for Portuguese

+OpenWN-PT: RDF Representation

n  Why? To address the issue of interoperability between wordnets. To be able to rely on Linked Data and Semantic Web standards such as RDF and OWL.

n  The emergence of Linked Data projects for lexical and reasoning resources make OpenWN-PT encoded and distributed in RDF/OWL.

n  Standards allow both the data model and the actual data in the same format. Plus range of existing data processing tools, including databases (“triple stores”) with SQL-like query interfaces (SPARQL).

n  Standard W3C encoding of WordNet in RDF since 2006. OpenWN-PT is is modelled after and fully interoperable with Princeton WordNet.

n  This means that one can easily find Portuguese equivalents for specific English word senses and vice versa. Also means OpenWN-PT is part of a large ecosystem of compatible resources, including domain identifiers and mappings to Wikipedia.

Page 21: Lexical Resources  for Portuguese

+Progress Report n  Checking is much easier than starting from scratch..

n  But long and tedious work to check even the initial 5k synsets suggested by GWA (not done, yet!), let alone all synsets in OpenWN-PT

n  Necessary? YES! Lexical gaps of all sorts

n  But resource is being used, warts and all…

n  Improving the resource: new data from Bond/Foster and some manual additions

Page 22: Lexical Resources  for Portuguese

+Use Cases: FreeLing

n  Word Sense Disambiguation via FreeLing 3.0 An Open Source Suite of Language Analyzers

n  OpenWN-PT has been incorporated into FreeLing (Padro’ and Stanilovsky, 2012)

n  Using Freeling’s word sense disambiguation framework, a given Portuguese text can automatically be annotated with word senses.

n  UPC, Barcelona

Page 23: Lexical Resources  for Portuguese

+Use Cases: Sentiment Analysis

n  Sentiment Analysis, using tweets about soccer games

n  OpenWN-PT and SentiWordNet to compare the MachineLearning-based sentiment analysis integrated into IBM InfoSphere Streams (ISS) platform.

n  1 million tweets, 4 friendly matches Brazilian team in 2013, 7 classes of positivity

n  IBM Research, BR

Page 24: Lexical Resources  for Portuguese

+Use Cases: NomLex-Br (Livy Real)

n  extension of OpenWN-PT aims at incorporating links to connect deverbal nouns with their corresponding verbs.

n  For English, NOMLEX (Macleod et al., 1998) has provided extensive descriptions of nominal- izations via extensions of initial core.

n  NOMLEX was constructed starting out with nominalizations with the suffixes -ion, -ment and -er, taking samples of the most frequent words first in a list of nouns from a combination of the Brown Corpus and the Wall Street Journal (about 1 million words of each).

n  NOMLEX-BR Translation of initial core, plus French Nomage

n  Overall, we have created over 2,000 entries. These have been integrated into OpenWN-PT, will facilitate their use for linguistic research as well as information extraction

The destruction of the city by Alexander in 330BC…

Page 25: Lexical Resources  for Portuguese

+Use Cases: NomLex-Br (Livy Real)

n  Incorporating NOMLEX-BR data into OpenWN-PT has shown itself useful in pinpointing some issues with the coherence and richness of OpenWN-PT.

n  the word abasement corresponds in NOMLEX to the verb abase, and thus we would like a similar correspondence between the Portuguese noun aviltamento and the verb aviltar (our suggested translations). OpenWN-PT simply has two synsets humilhar, abaixar and humilhar, rebaixar. The more common verb humilhar is repeated, while the uncommon aviltar was left out.

n  Other useful kinds of relationships between parts of speech (say the connections between ad- jectives and adverbs) are likely to also help to improve the accuracy and richness

Page 26: Lexical Resources  for Portuguese

+Miscellaneous Experiments

n  Coverage: Using DHBB to complete NOMLEX-BR.

n  Other paper…

n  Accuracy: choose six relations: hypernymOf, memberHolonymOf, instanceOf, substanceHolonymOf, entails and causes.

n  For each of these relations, we randomly chose 30 pairs of synsets and then random words from each synset. We ended up with 180 random sentences for manual verification.

n  The linguist marked each sentence as “correct”, “wrong” or “dubious”. Obtained 150 sentences correct (83% of the sentences), 17 marked as wrong, 13 marked as dubious.

n  Need more systematic effort. But results were encouraging

Page 27: Lexical Resources  for Portuguese

+ Conclusions n  We discussed the implementation and some

applications of OpenWordNet-PT, an open Word- Net for Brazilian Portuguese.

n  Recent improvements include better coverage and nominalization links connecting nouns and verbs.

n  The resource has been used in developing a high-throughput commercial system as well as in a cultural heritage project, and we anticipate that numerous further applications will follow.

n  The data is freely available from http://github.com/ arademaker/wordnet-br/ and a SPARQL Endpoint at logics.emap.fgv.br:10035.

n  Browsing via Open Multilingual Wordnet //www.casta-net.jp/ ~kuribayashi/ cgi-bin/wn-multi.cgi is fun

Page 28: Lexical Resources  for Portuguese

+ OpenWN-PT: next steps?..

n  First finish translating the “core” synsets in the Princeton WordNet to Portuguese.

n  Increase number of relations in OpenWN-PT as a way of improving adequacy and coherence.

n  Adding the Portuguese terms that satisfy different relations?

n  OpenVerbNet-PT?

n  Since we have a first target corpus, the Brazilian Historical Biographic Dictionary, we can also calculate word frequency to prioritize expansion of the OpenWN-PT and go back to the ontology building...

Page 29: Lexical Resources  for Portuguese

+

Thanks!

Page 30: Lexical Resources  for Portuguese

+References Revisiting a Brazilian Wordnet. Valeria de Paiva, Alexandre Rademaker,  (2012) Proceedings of Global Wordnet Conference, Global Wordnet Association, Matsue. OpenWordNet-PT: An Open Brazilian WordNet For Reasoning. de Paiva, Valeria, Alexandre Rademaker, and Gerard de Melo. In Proceedings of the 24th International Conference On Computational Linguistics. http://hdl.handle.net/10438/10274. OpenWordNet-PT: A Project Report. Alexandre Rademaker, Valeria de Paiva, Gerard de Melo, Livy Real and Maira Gatti. Proceedings of the 7th Global Wordnet Conference, Tartu, Estonia. Global Wordnet Association, 2014. Embedding NomLex-BR Nominalizations Into OpenWordnet-PT. Coelho, Livy Maria Real, Alexandre Rademaker, Valeria De Paiva, and Gerard de Melo. 2014. In Proceedings of the 7th Global WordNet Conference. Tartu, Estonia

Page 31: Lexical Resources  for Portuguese

+Other stuff to add in?…

n  Onto.PT, ES wordnet?

n  Editing interfaces?

n  BabelNet?

n  NER issues?

n  Temporal issues?

n  Work with Claudia Freitas?…Leonel?

n  Work on implicatives/factives in Portuguese?

n  FOIS workshop

Page 32: Lexical Resources  for Portuguese

+References Towards a Universal Wordnet by Learning from Combined Evidence  Gerard de Melo, Gerhard Weikum (2009) 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China. Bridges from Language to Logic:  Concepts, Contexts and Ontologies Valeria de Paiva (2010)Logical and Semantic Frameworks with Applications, LSFA'10, Natal, Brazil, 2010. `A Basic Logic for Textual inference", AAAI Workshop on Inference for Textual Question Answering, 2005. ``Textual Inference Logic: Take Two", CONTEXT 2007. ``Precision-focused Textual Inference", Workshop on Textual Entailment and Paraphrasing, 2007. PARC's Bridge and Question Answering System Proceedings of Grammar Engineering Across Frameworks, 2007.

Page 33: Lexical Resources  for Portuguese

+ Simplifying the PARC’s Bridge Architecture

Idea: Simplify and reproduce components in PORTUGUESE

F-structure semantics

KR

Parsing KR Mapping

Inference Engines Text

Sources

Question

Assertions

Query

Grammar Stanford Parser

Textual Inference logics

Term rewriting OpenWN-PT SUMO-PT KR mapping rules