Upload
truongbao
View
248
Download
11
Embed Size (px)
Citation preview
Unsupervised Morphological SegmentationBased on
Word Segments Predictability and Alignment
Delphine Bernhard
Unsupervised Segmentation of Words into MorphemesPascal Challenge Workshop, April 12, 2006
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 1 / 21
Why ?
ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)
Examplesdermatofibrosarcomaglomeroporphyritic
slammograms
(refers to mammograms)
pneumonoultramicroscopicsilicovolcanoconiosis
"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21
Why ?
ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)
Examplesdermatofibrosarcomaglomeroporphyriticslammograms
(refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis
"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21
Why ?
ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)
Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)
pneumonoultramicroscopicsilicovolcanoconiosis
"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21
Why ?
ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)
Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis
"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21
Why ?
ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)
Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis
(But this is a hoax !)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21
Why ?
ContextWork on the morphology of domain-specific vocabulary, esp. medicallanguage (many neoclassical compounds)
Examplesdermatofibrosarcomaglomeroporphyriticslammograms (refers to mammograms)pneumonoultramicroscopicsilicovolcanoconiosis"a lung disease caused by the inhalation of very fine silica dust,mostly found in volcanoes" = pneumoconiosis(But this is a hoax !)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 3 / 21
Objectives
Automatic acquisition of semantic relationships thanks tomorphological relatedness
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 4 / 21
Constraints
Take into account all of the following word formation processes:inflectionderivationcompounding
Method not limited to French or English.Distinguish between different types of word segments:
prefixsuffixstemlinking element
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 6 / 21
Overview of the method
InputList of words
Stages1 Acquisition of prefixes and suffixes2 Acquisition of stems3 Alignment of word segments4 Selection of the best segmentation for each word
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 7 / 21
Acquisition of prefixes and suffixes [1]
InputLongestwords
Locate positions withlow segment predictability
OutputSegments
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 8 / 21
Acquisition of prefixes and suffixes [1]
InputLongestwords
Locate positions withlow segment predictability
OutputSegments
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 8 / 21
Acquisition of prefixes and suffixes [1]
InputLongestwords
Locate positions withlow segment predictability
OutputSegments
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 8 / 21
Acquisition of prefixes and suffixes [2]
Identification of a stem among the segments
Prefixes and suffixes
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 9 / 21
Extraction of stems
Subtract prefixes and suffixes from all words
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 10 / 21
Alignment of word segments [2]
Validation of new prefixes and suffixes
Words Known suffixes Potential stems New suffixesA1 A2 A3
hormonal -alhormonotherapy -otherapyhormone -ehormones -es
|A1|+ |A2||A1|+ |A2|+ |A3|
≥ a and|A1|
|A1|+ |A2|≥ b
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 12 / 21
Segmentation of new words
For each word, select segments so that the total cost is minimalCost functions used:
cost1(si) = −logf (si)∑i f (si)
cost2(si) = −logf (si)
maxi[f (si)]
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 14 / 21
Evaluation
Position of boundariesMorphoChallenge evaluation
Conflation setsCheck if word forms containing the same stem are related
Test on an English corpus on breast cancer(about 86,000 word types).Manually built morphological families for the top 5,000 key wordsResults: F-measure ∼ 50%(Recall = 40% ± 7, Precision = 66% ± 7)
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 16 / 21
Examples [1]
Words Segmentationschondrosarcomas chondro + sarcoma + scystosarcoma cyst + o + sarcomadermatofibrosarcomas derm + at + o + fibro + sarcoma + sfibroxanthosarcoma fibroxanthosarcomaleiomyosarcoma leiomyo + s + arc + omaleiomyosarcomas leiomyo + sarcoma + sliposarcoma lipo + sarcomalymphangiosarcomas lymph + angiosarcoma + smyxofibrosarcoma myxo + fibro + sarcomamyxosarcomas myxo + sarcoma + sneurofibrosarcoma neur + o + fibro + sarcomaosteosarcoma osteo + sarcomaosteosarcomatous osteosarcoma + toussarcoma sarcomasarcomatoid sarcoma + t + oid
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 17 / 21
Examples [2]
Words Segmentationsauto-transplant auto + - + transplantauto-transplantation auto + - + transplant + ationautotransplantation auto + transplant + ationpost-transplantation post + - + transplant + ationposttransplantation post + transplant + ationretransplantation re + transplant + ationtransplantability transplantabilitytransplant trans + planttransplanted trans + plant + e + dtransplanting trans + plant + ingtransplants trans + plant + sxenotransplantation xenotransplant + ationxenotransplanted xenotransplant + edxenotransplants xeno + transplants
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 18 / 21
Main issues
Over-segmentationleiomyo + s + arc + omag + lobul + e
⇒ Low precision
Under-segmentationtransplantabilityxenotransplant + ationxenotransplant + ed
⇒ Low recall
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 19 / 21
Conclusion
SummaryMethod usable for languages other than French and EnglishPerforms segmentation + distinguishes between different kinds ofsegments
Future workUse other data structures to deal with very, very large corporaDeal with variations within stems (accents, alternations)Evaluate how well word segments predict semantic relationshipsbetween terms
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 20 / 21
Thank you
Further information:[email protected]
http://www-timc.imag.fr/Delphine.Bernhard/
Delphine Bernhard (TIMC-IMAG) MorphoChallenge Workshop 21 / 21