86
Machine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Machine Translation WiSe 2015/2016 - Hasso Plattner · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

  • Upload
    dangtu

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Machine TranslationWiSe 2015/2016

Words, Sentences, Corpora

Dr. Mariana Neves October 24th, 2016

Page 2: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Overview

● Words

● Sentences

● Corpora

2

Page 3: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Overview

● Words

● Sentences

● Corpora

3

Page 4: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Words

● It is the basic unit of meaning

4

house

cat

happy

run

Page 5: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Syllables

● They are parts of a word

● They produce no mental image

5

house

hous-es

hap-py

run

Page 6: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Compound word

● It is the junction of two or more words

● One or two words?

– house's = house + 's

– rooftop = roof + top

– doesn't = does not

6

the house's rooftop

the house doesn't have a rooftop

Page 7: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Idiomatic expressions

● It s a expression which has a meaning different from the words it contains

● „house of cards“

– an organization or a plan that is very weak and can easily be destroyed

7

(http://idioms.thefreedictionary.com/a+house+of+cards http://www.imdb.com/title/tt1856010/ http://www.fool.com/investing/general/2014/01/29/is-netflix-a-house-of-cards.aspx)

Page 8: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Tokenization

● It is the task of breaking a text (sentence) into words (tokens)

● Pre-processing steps for many NLP tasks

8(analysis from http://nlp.stanford.edu:8080/parser/index.jsp)

The house doesn't has a rooftop

The | house | does | n't | has | a | rooftop

Page 9: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Tokenization = Word segmentation

● For writing systems without space separators between words

9(http://www.basistech.com/better-multilingual-search/)

Page 10: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Word segmentation

● Contractions also need to be segmented

10(https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions)

ain’t am not

aren’t are not

can’t can not

could’ve could have

couldn’t could not

couldn’t’ve could not have

didn’t did not

doesn’t does not

don’t do not

hadn’t had not

Page 11: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Word segmentation

● Should hyphenated words be separated?

– They lose their original meaning when separated

11(http://www.oxforddictionaries.com/words/hyphen)

accident-prone to ice-skate

good-looking to booby-trap

sugar-free to spot-check

power-driven to court-martial

quick-thinking

Page 12: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Word segmentation

● Should compound words be separated?

– They lose their original meaning when separated

12

(http://www.theguardian.com/world/2013/jun/03/indfleischetikettierungsberwachungsaufgabenbertragungsgesetz-word-germany)

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

(“has now been confined to the linguistic history books by authorities in Mecklenburg-Vorpommern.”)

Page 13: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

More very long words...

13

(http://www.theguardian.com/world/2013/jun/03/indfleischetikettierungsberwachungsaufgabenbertragungsgesetz-word-germany)

Pneumonoultramicroscopicsilicovolcanoconiosis A lung disease caused by inhalation of silica dust. The longest word listed in an English language

Floccinaucinihilipilification Act or habit of estimating as worthless. Longest in Hansard, by Jacob Rees-Mogg

Taumatawhakatangihangakoauauotamateaturipukakapikimaunga-horonukupokaiwhenuakitanatahu Longest-named place in the world, given to a New Zealand hill by the Maoris

Hippopotomonstrosesquipedaliophobia The fear of long words

Page 14: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Lowercasing and truecasing

● Lowercasing: converting the word to lowercase

– House = house = HOUSE

● Truecasing: keeping uppercase letters in names

– Mr. Fischer ≠ fisher

14

Page 15: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Translation processing pipeline

● Inverse tasks

– detokenization

– decasing

15

The boy's house is small

the boy 's house is small

das haus des jungens ist klein

Das Haus des Jungens ist klein

das haus von dem jungen ist klein

Das Haus vom Jungen ist klein

(tokenization, lowercasing)

(translation)

(decasing)

(detokenization, decasing)

Page 16: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Vocabulary (V)

● It is the set of words which constitute a language

● A text is a sequence of the words from the vocabulary

16

Page 17: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Size of the vocabulary (language)

● Fluid concept

– Words are coined every day!

17(http://www.languagemonitor.com/number-of-words/number-of-words-in-the-english-language-1008879/)

The number of words in the English language is: 1,025,109.8.

This is the estimate by the Global Language Monitor on January 1, 2014.

The English Language passed the Million Word threshold on June 10, 2009 at 10:22 a.m. (GMT).

The Millionth Word was the controversial ‘Web 2.0′.

Currently there is a new word created every 98 minutes or about 14.7 words per day.

Page 18: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Size of the vocabulary (corpus)

● It is the number of words in a corpus

18

(http://googlebooks.byu.edu/)

Page 19: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Word tokens vs. word types

● Word tokens: each word in the corpus

● Word types: each unique word in the corpus, no repetition

● Google N-Gram corpus

– 1,024,908,267,229 word tokens

– 13,588,391 word types

● Why so many word types?

– Large English dictionaries have around 500k word types

19

Page 20: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Functional vs. non-functional words

● Functional

– Words that have none or little semantic meaning

– Stopwords, close-class

– Articles, pronouns, conjuctions

● Non-functional words

– Words that have semantic meaning

– Content words, open-class

– Verbs, nouns, adjectives

20

The boy's house is small

Page 21: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Distribution of functional and non-functional words

● Functional vs. non-functional words

21(http://norvig.com/ngrams/count_1w.txt)

Page 22: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Lemma vs. stem

● Lemma

– Canonical or dictionary form of a word

– are, is, am, was, were, being be

● Stem or root (Porter stemmer)

– A part of the word to which suffixes and prefixes can be attached

– wants, wanted, wanting, unwanted want

– am am

– computer comput

22

(http://text-processing.com/demo/stem/)

Page 23: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Zipf's law

● The frequency of any word is inversely proportional to its rank in the frequency table

23(https://finnaarupnielsen.wordpress.com/2013/10/22/zipf-plot-for-word-counts-in-brown-corpus/)

Page 24: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Zipf's law

● The most frequent word will occur approximately

– twice as often as the second most frequent word,

– three times as often as the third most frequent word,

– …

● Rank of a word times its frequency is approximately a constant

– Rank · Freq ≈ c

– c ≈ 0.1 for English

24

Page 25: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Zipf's law

25

Page 26: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Zipf's law

● Not very accurate for very frequent and very infrequent words

26

Page 27: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Zipf's law

● … and very infrequent words

27

Page 28: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Functional (closed-class)

– limited number of words, do not grow usually

28

Auxiliary can, should, have

Article/Determiner the, some

Conjunction and, or

Pronoun he, my

Preposition to, in

Particle off, up

Interjection Ow, Eh

Page 29: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Functional (closed-class)

– Languages differ greatly on them

● „the“ vs. „der“, „die“, „das“, „dem“, „den“, „des“

29

The house of the boy is small

Das Haus des Jungen ist klein

Das Haus vom (von dem) Jungen ist klein

Page 30: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Functional (closed-class)

– Languages differ greatly on them

● No equivalent to „the“ in Chinese and Japanese

– Choose among „the“, „a“ or nothing

30

The house of the boy is smallThe house of a boy is smallA house of the boy is small

少年の家は小さいです

Boy of the house is small

少年の家は小さいです

(http://translate.google.de)

Page 31: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Functional (closed-class)

– Languages differ greatly on them

● „wa“ particle in Japanese

– Next to which noun phrase to place it?

31

Boy of the house is small

少年の家は小さいです

(http://translate.google.de)

The boy's house is small

少年の家は小さいです

Page 32: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Content (open-class) words

– Unlimited number of words

– New ones are coined every day

32

Noun book/books, nature, Germany, Sony

Verb eat, wrote

Adjective new, newer, newest

Adverb well, urgently

Page 33: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Part-of-speech tags

● They are classes of words which have similar grammatical properties

– Verbs, adjectives, pronouns, etc.

33

(http://nlp.stanford.edu:8080/corenlp/process)

Page 34: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Penn TreeBank Tagset

34 (http://www.americannationalcorpus.org/OANC/penn.html)

Page 35: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Part-of-speech in MT

35

Page 36: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Part-of-speech in MT

„I like ice-cream“

„Ich mag Eis“

„Ich wie Eis“

36

Page 37: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Morphology

● It is the analysis and identification of the morphemes

● Morpheme is the smallest grammatical unit

– Free morphemes: „dog“ in „doghouse“

– Bound morphemes (affixes):

● Suffixes: „exactly“● Prefixes: „unhappiness“

37

Page 38: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Morphology

● Derivational morpheme

– Changes the meaning or the functional role (POS tag)

● „kind“ vs. „unkind“ (opposite meanings)● „happy“ vs. „happiness“ (adjective to noun)

● Inflectional morphemes

– Do not affect meaning nor the part-of-speech tag

● „house“ vs. „houses“● „want“ vs. „wanted“

38

Page 39: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Morphology – aglutinative

39(http://allthingslinguistic.com/post/50939757945/morphological-typology-illustrations-from)

Page 40: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Morphology – polysynthetic

40(http://allthingslinguistic.com/post/50939757945/morphological-typology-illustrations-from)

Page 41: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Ubykh – extinct language

41(https://en.wikipedia.org/wiki/Ubykh_language)

“The language's last native speaker, Tevfik Esenç, died in 1992.”

Page 42: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Cases

– Give more flexibility for the sentence structure

Der Löwe frißt das Zebra

Den Löwe frißt das Zebra

Das Zebra frißt den Löwe

Das Zebra frißt der Löwe

42

Finnish has 15 cases!

Page 43: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges for MT

● Prepositions

– Change the meaning of the sentence

I go to the house

I go from the house

I go in the house

I go on the house

I go by the house

I go through the house

43

Page 44: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Challenges in MT

● Gender

– „the“ vs. „der“, „die“, „das“, „dem“, „den“, „des“

44 (https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Flexi%C3%B3nGato.png/220px-Flexi%C3%B3nGato.png)

Page 45: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Lexical semantics

● The meaning of a word

● band (music group)

● band (material)

● band (wavelength)

● fall (season)

● fall (verb)

45

Page 46: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

WordNet

46(https://wordnet.princeton.edu/)

Page 47: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

WordNet

47(https://wordnet.princeton.edu/)

Page 48: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Semantics in MT

48(http://www.leo.org/)

Page 49: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Word sense disambiguation

● It is the task of determining the right sense of a word in a text

● Usually make use of dictionaries (definitions) and/or the context

49

Page 50: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Overview

● Words

● Sentences

● Corpora

50

Page 51: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence structure

● Verb is the central element of a sentence

Jane swims.

Jane bought the house.

Jane gave Joe the book.

51

Page 52: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence structure

● Verbs may have many valencies, arguments, adjunts, ...

Jane swims.

Jane bought the house.

Jane bought the house from Jim.

Jane gave Joe the book.

Jane bought the beautiful house.

Jane bought the beautiful house in the city center.

52

Page 53: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence structure

● Sentences can include Recursion

– Nested construction of constituents

Jane, who recently won the lottery, bought the house that was just put in the market.

– Jane bought the house.

– Jane recently won the lottery.

– The house was just put in the market.

53

Page 54: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence structure

● Clause

– Part of sentence which is composed of a verbs and its arguments or adjuncts

Jane bought the house that was just put in the market.

54

Page 55: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence structure

● Structural ambiguity

Joe eats steak with a knife.

Jim eats steak with ketchup.

Jane watches the man with the telescope.

Jim washes the dishes and watches television with Jane.

55

Page 56: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Grammar

● Chunking or shallow parsing

– Noun phrases, verbal phrases, etc.

56(http://nlp.stanford.edu:8080/parser/index.jsp)

(NP My dog) (ADVP also) (VP likes) (VP eating) (NP sausage) (. .)

Page 57: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Grammar

● Parse tree (full parsing)

– It represents the syntactic structure of a sentence

57 (http://nlp.stanford.edu:8080/parser/index.jsp)

S

NP ADVP VP

PRP$ NN RB VBZ

alsoMy dog likes

S

VP

VBG

eating

NP

NN

sausage

.

.

Page 58: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Grammar

● Context-free grammars

– Non-terminal symbols (POS tags and phrases)

– Terminal symbols (words)

58

S → NP VPS → VPNP → NNNP → PRPNP → DT NNNP → NP NPNP → NP PP...

PRP$ → MyNN → dogRB → alsoVBZ → likesVBG → eatingNN → sausage

Page 59: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Grammar

● Probabilistic context-free grammars

– Probabilities are associated to rules and POS tags

59

0.9 S → NP VP

0.1 S → VP

0.3 NP → NN

0.4 NP → PRP

0.1 NP → DT NN

0.2 NP → NP NP

0.1 NP → NP PP

...

1.0 PRP$ → My

0.9 NN → dog

1.0 RB → also

0.6 VBZ → likes

0.5 VBG → eating

0.8 NN → sausage

Page 60: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Grammar

● Dependency tree

– Alternative syntactic structure

– It shows relationships between words

60

(http://nlp.stanford.edu:8080/parser/index.jsp)

ROOT-0

likes-4

eating-5 also-3 dog-2

My-1sausage-6

root

xcompadvmod

nsubj

dobj nmod:poss

Page 61: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Grammar

● Head words

– They are words on which others depend

61(http://nlp.stanford.edu:8080/corenlp/process)

ROOT-0

likes-4

eating-5 also-3 dog-2

My-1sausage-6

root

xcompadvmod

nsubj

dobj nmod:poss

Page 62: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Translation of sentence structure

● Reordering of words

● Insertion and deletion of function words

62

The house of the boy is small

少年の家は小さいです

Page 63: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

63

S

NP VP

NNP NNP MD VP

Pierre Vinken willVB NP

DT NNjoin

the board

S

NP VP

NNP NNP MD VP

Pierre Vinken wird NP

NP DT

dem Vorstand

VB

beitreten

Page 64: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Translation of sentence structure

64

Vinken

will

join

theboard

Vinken

wird

demVorstand

beitreten

Page 65: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Discourse

● Co-reference

– Anaphora: denotes the act of referring backwards in a dialog or text

– Cataphora: sees the act of referring forward in a dialog or text

Susan dropped the plate. It shattered loudly.

The music stopped, and that upset everyone.

Fred was angry, and so was I.

If Sam buys a new bike, I will do it as well.

Because he was very cold, David put on his coat.

Although Sam might do so, I will not buy a new bike.

65 (https://en.wikipedia.org/wiki/Anaphora_(linguistics))

Cataphora

Anaphora

Page 66: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Overview

● Words

● Sentences

● Corpora

66

Page 67: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Domains

● Politics (European Parlament)

● News

● Conversations

● Biomedicine (neuroscience)

● ...

67

Page 68: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Modality

● Spoken language is different of written language.

68

Page 69: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Acquiring parallel corpora

● Parallel corpora

69

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.Lately, with gold prices up more than 300% over the last decade, it is harder than ever.

SAN FRANCISCO – Es war noch nie leicht, ein rationales Gespräch über den Wert von Gold zu führen.In letzter Zeit allerdings ist dies schwieriger denn je, ist doch der Goldpreis im letzten Jahrzehnt um über 300 Prozent angestiegen.

Page 70: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Acquiring parallel corpora

● Parallel corpora

– Web pages

– Books in many translations (Bible)

– Manuals of products

– ...

70

Page 71: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Crawling for parallel corpora

● Web technical issues, HTML parsing

71(http://www.scielo.br/scielo.php?script=sci_abstract&pid=S0103-64402015000100003&lng=en&nrm=iso&tlng=en)

<a href="http://www.scielo.br/cgi-bin/wxis.exe/iah/?IsisScript=iah/iah.xis&amp;base=article%5Edlibrary&amp;format=iso.pft&amp;lang=i&amp;nextAction=lnk&amp;indexSearch=AU&amp;exprSearch=GIANNINI,+MARCELO">GIANNINI, Marcelo</a> et al.<span class="article-title"> Self-Etch Adhesive Systems: A Literature Review.</span><i> Braz. Dent. J.</i> [online]. 2015, vol.26, n.1, pp. 3-10. ISSN 0103-6440. http://dx.doi.org/10.1590/0103-6440201302442.</p><p xmlns=""><p>This paper presents the state of the art of self-etch adhesive systems. Four topics are shown in this review and included: the historic of this category of bonding agents, bonding mechanism, characteristics/properties and the formation of acid-base resistant zone at enamel/dentin-adhesive interfaces. Also, advantages regarding etch-and-rinse systems and classifications of self-etch adhesive systems according to the number of steps and acidity are addressed. Finally, issues like the potential durability and clinical importance are discussed. Self-etch adhesive systems are promising materials because they are easy to use, bond chemically to tooth structure and maintain the dentin hydroxyapatite, which is important for the durability of the bonding.</p></p><p xmlns=""><strong>Keywords

:</strong>Self-etch adhesive systems; adhesion; dentin..

</p><p xmlns=""><div xmlns:xlink="http://www.w3.org/1999/xlink" align="left">

Page 72: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Document alignment

● Some resoures provide documents in more than one language

72

(http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0103-644020150001&lng=en&nrm=iso)

Page 73: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Document alignment

● Are documents really aligned???

– Mix of languages in the publication

73

(http://www.scielo.br/scielo.php?script=sci_abstract&pid=S0103-64402015000100003&lng=en&nrm=iso&tlng=pt)

Portuguese

English

Page 74: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Document alignment

● Web pages in more than one language

● But sometimes they are not perfect translation of each other

74

Page 75: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

75

(http://www.bbc.com/news/world-asia-34463608http://www.bbc.com/afrique/monde/2015/10/151007_msf_international_investigations )

Kunduz bombing: MSF demands Afghan war crimes probe

Aid agency Medecins Sans Frontieres is seeking to invoke a never-used body to investigate the US bombing of its hospital in the Afghan city of Kunduz.

MSF said it did not trust internal military inquiries into the bombing that killed at least 22 people.

The International Humanitarian Fact-Finding Commission (IHFFC) was set up in 1991 under the Geneva Conventions.

,,,

MSF appelle à une enquête indépendante

Médecins Sans Frontières (MSF) appelle à une enquête internationale sur le bombardement américain la semaine dernière contre un hôpital appartenant à l’ONG dans la ville afghane de Kunduz.

MSF exhorte les Etats membres à saisir la Commission internationale humanitaire d'établissement des faits (CIHEF) afin que des investigations indépendantes soient menées.

Selon l’ONG, le bombardement, qui a tué 22 personnes, contrevenait à la Convention de Genève.

...

Page 76: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Comparable corpora

● Documents on the same topics

– Same events, same date, etc.

● But no direct translations!

76

Page 77: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Comparable corpora

77

(http://www.sueddeutsche.de/wissen/-chemie-nobelpreis-fuer-forschung-zu-dna-reparatur-1.2680858 http://www.bbc.com/news/uk-england-34464580)

Chemie-Nobelpreis für Forschung zu DNA-Reparatur

Tomas Lindahl, Paul Modrich und Aziz Sancar haben herausgefunden, wie der Körper beschädigtes Erbgut wieder herstellt - ein wichtiger Grundstein für die Entwicklung von Krebsmedikamenten.

Der Nobelpreis für Chemie geht in diesem Jahr an die DNA-Forscher Paul Modrich aus den USA, Tomas Lindahl aus Schweden und an den türkisch-amerikanischen Wissenschaftler Aziz Sancar.

…..

Chemistry Nobel: Lindahl, Modrich and Sancar win for DNA repair

The 2015 Nobel Prize in Chemistry has been awarded for discoveries in DNA repair.Tomas Lindahl and Paul Modrich and Aziz Sancar were named as the winners on Wednesday morning at a news conference in Stockholm, Sweden.Their work uncovered the mechanisms used by cells to repair damaged DNA - a fundamental process in living cells and important in cancer.".…..

Page 78: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence alignment

● It is an essential step in corpus preparation

78

1 <=> 12,3 <=> 24 <=> 35 <=> 46,7 <=> 58,9 <=> 610 <=> 711 <=> omitted12 <=> 813,14 <=> 915 <=> 1016 <=> 11

(output of the GMA tool: http://nlp.cs.nyu.edu/GMA/)

one sentence aligned to another

two sentences aligned to one sentence

one sentence not aligned to any other

Page 79: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Sentence alignment

● All sentences need to be accounted

● Each sentence may occur in only one aligment

79

1 <=> 12,3 <=> 24 <=> 35 <=> 46,7 <=> 58,9 <=> 610 <=> 711 <=> omitted12 <=> 813,14 <=> 915 <=> 1016 <=> 11

(output of the GMA tool: http://nlp.cs.nyu.edu/GMA/)

Page 80: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Summary – Natural Language Processing steps

80

incr

ease

d co

mpl

exity

of

proc

essi

ng

Morphology

POS Tagging

Chunking

Parsing

Semantics

The dog jumped from a chair.

Page 81: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Summary – Natural Language Processing steps

81

incr

ease

d co

mpl

exity

of

proc

essi

ng

Morphology

POS Tagging

Chunking

Parsing

Semantics

The dog jumped from a chair.

The dog jump-[Past-Sing-3rd] from a chair.

Page 82: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Summary – Natural Language Processing steps

82

incr

ease

d co

mpl

exity

of

proc

essi

ng

Morphology

POS Tagging

Chunking

Parsing

Semantics

The dog jumped from a chair.(http://nlp.stanford.edu:8080/parser/index.jsp)

The/DT dog/NN jumped/VBD from/IN a/DT chair/NN ./.

Page 83: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Summary – Natural Language Processing steps

83

incr

ease

d co

mpl

exity

of

proc

essi

ng

Morphology

POS Tagging

Chunking

Parsing

Semantics

The dog jumped from a chair.

(NP The dog) (VP jumped) (PP from a chair)

(http://nlp.stanford.edu:8080/parser/index.jsp)

Page 84: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Summary – Natural Language Processing steps

84

incr

ease

d co

mpl

exity

of

proc

essi

ng

Morphology

POS Tagging

Chunking

Parsing

Semantics

The dog jumped from a chair.(http://nlp.stanford.edu:8080/parser/index.jsp)

(ROOT (S (NP (DT The) (NN dog)) (VP (VBD jumped) (PP (IN from) (NP (DT a) (NN chair)))) (. .)))

Page 85: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Summary – Natural Language Processing steps

85

incr

ease

d co

mpl

exity

of

proc

essi

ng

Morphology

POS Tagging

Chunking

Parsing

Semantics

The dog jumped from a chair.(http://wordnetweb.princeton.edu/perl/webwn)

Page 86: Machine Translation WiSe 2015/2016 - Hasso Plattner  · PDF fileMachine Translation WiSe 2015/2016 Words, Sentences, Corpora Dr. Mariana Neves October 24th, 2016

Suggested reading

● Chapter 2

86