28
Unsupervised Morphological Segmentation Based on Word Segments Predictability and Alignment Delphine Bernhard Unsupervised Segmentation of Words into Morphemes Pascal Challenge Workshop, April 12, 2006 Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 1 / 21

Unsupervised Morphological Segmentation Based …morpho.aalto.fi/events/morphochallenge2005/bernhard.pdfUnsupervised Morphological Segmentation Based on Word Segments Predictability

Embed Size (px)

Citation preview

Unsupervised Morphological SegmentationBased on

Word Segments Predictability and Alignment

Delphine Bernhard

Unsupervised Segmentation of Words into MorphemesPascal Challenge Workshop, April 12, 2006

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 1 / 21

Part I

Motivation

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 2 / 21

Why ?

ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)

Examplesdermatofibrosarcomaglomeroporphyritic

slammograms

(refers to mammograms)

pneumonoultramicroscopicsilicovolcanoconiosis

"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21

Why ?

ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)

Examplesdermatofibrosarcomaglomeroporphyriticslammograms

(refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis

"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21

Why ?

ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)

Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)

pneumonoultramicroscopicsilicovolcanoconiosis

"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21

Why ?

ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)

Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis

"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21

Why ?

ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)

Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis

(But this is a hoax !)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21

Why ?

ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)

Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21

Objectives

Automatic acquisition of semantic relationships thanks tomorphological relatedness

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 4 / 21

Part II

Method

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 5 / 21

Constraints

Take into account all of the following word formation processes:inflectionderivationcompounding

Method not limited to French or English.Distinguish between different types of word segments:

prefixsuffixstemlinking element

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 6 / 21

Overview of the method

InputList of words

Stages1 Acquisition of prefixes and suffixes2 Acquisition of stems3 Alignment of word segments4 Selection of the best segmentation for each word

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 7 / 21

Acquisition of prefixes and suffixes [1]

InputLongestwords

Locate positions withlow segment predictability

OutputSegments

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 8 / 21

Acquisition of prefixes and suffixes [1]

InputLongestwords

Locate positions withlow segment predictability

OutputSegments

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 8 / 21

Acquisition of prefixes and suffixes [1]

InputLongestwords

Locate positions withlow segment predictability

OutputSegments

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 8 / 21

Acquisition of prefixes and suffixes [2]

Identification of a stem among the segments

Prefixes and suffixes

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 9 / 21

Extraction of stems

Subtract prefixes and suffixes from all words

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 10 / 21

Alignment of word segments [1]

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 11 / 21

Alignment of word segments [2]

Validation of new prefixes and suffixes

Words Known suffixes Potential stems New suffixesA1 A2 A3

hormonal -alhormonotherapy -otherapyhormone -ehormones -es

|A1|+ |A2||A1|+ |A2|+ |A3|

≥ a and|A1|

|A1|+ |A2|≥ b

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 12 / 21

Selection of the best segmentation

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 13 / 21

Segmentation of new words

For each word, select segments so that the total cost is minimalCost functions used:

cost1(si) = −logf (si)∑i f (si)

cost2(si) = −logf (si)

maxi[f (si)]

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 14 / 21

Part III

Results and conclusion

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 15 / 21

Evaluation

Position of boundariesMorphoChallenge evaluation

Conflation setsCheck if word forms containing the same stem are related

Test on an English corpus on breast cancer(about 86,000 word types).Manually built morphological families for the top 5,000 key wordsResults: F-measure ∼ 50%(Recall = 40% ± 7, Precision = 66% ± 7)

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 16 / 21

Examples [1]

Words Segmentationschondrosarcomas chondro + sarcoma + scystosarcoma cyst + o + sarcomadermatofibrosarcomas derm + at + o + fibro + sarcoma + sfibroxanthosarcoma fibroxanthosarcomaleiomyosarcoma leiomyo + s + arc + omaleiomyosarcomas leiomyo + sarcoma + sliposarcoma lipo + sarcomalymphangiosarcomas lymph + angiosarcoma + smyxofibrosarcoma myxo + fibro + sarcomamyxosarcomas myxo + sarcoma + sneurofibrosarcoma neur + o + fibro + sarcomaosteosarcoma osteo + sarcomaosteosarcomatous osteosarcoma + toussarcoma sarcomasarcomatoid sarcoma + t + oid

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 17 / 21

Examples [2]

Words Segmentationsauto-transplant auto + - + transplantauto-transplantation auto + - + transplant + ationautotransplantation auto + transplant + ationpost-transplantation post + - + transplant + ationposttransplantation post + transplant + ationretransplantation re + transplant + ationtransplantability transplantabilitytransplant trans + planttransplanted trans + plant + e + dtransplanting trans + plant + ingtransplants trans + plant + sxenotransplantation xenotransplant + ationxenotransplanted xenotransplant + edxenotransplants xeno + transplants

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 18 / 21

Main issues

Over-segmentationleiomyo + s + arc + omag + lobul + e

⇒ Low precision

Under-segmentationtransplantabilityxenotransplant + ationxenotransplant + ed

⇒ Low recall

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 19 / 21

Conclusion

SummaryMethod usable for languages other than French and EnglishPerforms segmentation + distinguishes between different kinds ofsegments

Future workUse other data structures to deal with very, very large corporaDeal with variations within stems (accents, alternations)Evaluate how well word segments predict semantic relationshipsbetween terms

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 20 / 21

Thank you

Further information:[email protected]

http://www-timc.imag.fr/Delphine.Bernhard/

Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 21 / 21