A Program for Aligning Sentences in Bilingual Corpora

Embed Size (px)

Citation preview

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    1/8

    A P R O G R A M F O R A L I G N IN G S E N T E N C E S I N B IL I N G U A L C O R P O R AWil l i am A. Ga l e

    K e n n e t h W . C h u r c h

    AT & T Be l l Labora tor i es600 Mountain AvenueMurray Hi l l, NJ , 07974

    A B S T R A C TResearchers in bo th machine I ranslat ion (e .g . ,B r o w n et al., 1990) and b i l ingual lex icography(e .g . , K lav an s an d Tzo u k erman n , 1 9 9 0 ) h av erecen t l y b eco me in t e r es t ed i n s t u d y in g p ara l l e lt ex t s , t ex t s su ch as t h e Can ad i an Han sard s(p ar l iamen ta ry p ro ceed in g s ) w h ich a r e av a i l ab le i nmul t ip le languages (Frenc h and Engl ish) . Th isp ap er d esc r i b es a meth o d fo r a l i g n in g sen t en ces i nthese paral lel tex ts , based on a s imple s tat i s t icalmo d e l o f ch arac t e r l en g th s . Th e meth o d wasd ev e lo p ed an d t es t ed o n a smal l t r i l in g u a l samp leo f S wi ss eco n o m ic r ep o r t s. A m u ch l a rg er samp leo f 9 0 mi l l i o n wo rd s o f Can ad i an Han sard s h asb een a l i g n ed an d d o n a t ed t o t h e ACL/DC I .

    1. Introduction

    Research er s i n b o th mach in e l r an s l a t i o n ( e .g . ,Brown et a l , 1990) and b i l ingual lex icography(e .g . , K lav an s an d Tzo u k erman n , 1 9 9 0 ) h av erecen t l y b eco me in t e r es t ed i n s t u d y in g b i l i n g u a lco rp o ra , b o d i es o f t ex t su ch as t h e Can ad i anI- lansards (par l iamentary debates) which areavai lab le in mul t ip le languages (such as Frenchan d En g l i sh ). Th e sen t en ce a l i g n men t t ask i s toid en t i fy co r r esp o n d en ces b e tween sen t en ces i no n e l an g u ag e an d sen t en ces i n t h e o th e r l ang u ag e .This task i s a f i rs t s tep toward the more ambi t ioust ask f i n d in g co r r esp o n d an ces amo n g wo rd s . ITh e i n p u t i s a p a i r o f t ex t s su ch as Tab l e 1 .

    1. In s ta t i s t ic s , st r ing ma tching pro blems a re d iv ided in to twoc lasses : alignmentp ro b l e m s a n d correspondancep ro b l e m s .Cross ing d ependenc ies a re poss ib le in the la t te r , but no t int h e fo rm e r .

    Table 1:Input to Al ignment ProgramE n g l i s hAcco rd in g t o o u r su rv ey , 1 9 8 8 sa l es o f min era lwa t e r an d so f t d r i n k s were mu ch h ig h er t h an i n1 9 8 7, r e f l ec t in g t h e g ro win g p o p td m' it y o f t h esep ro d u c t s. Co l a d r i n k man u fac tu re r s i n p a r ti cu l a rach i ev ed ab o v e-av erag e g ro wth r a t es. Th eh ig h er t u rn o v er was l a rg e ly d u e t o an i n c rease i nth e sa l es v o lu me. Em p lo y m en t an d i n v es tmen tl ev e l s a l so c l imb ed . F o l l o win g a two -y e arI ransi t ional per iod , the new FoodstuffsOrd in an ce fo r M in era l Wate r ca me i n to e f f ec t o nApri l 1 , 1988 . Speci f ical ly , i t con tains mo res t r i n g en t r eq u i r emen t s r eg ard in g q u a l i t yco n s i s t en cy an d p u r i t y g u aran t ees .F r e n c hQu an t au x eau x r a in &ales e t au x l imo n ad es , e l l e sr en co n t r en t t o u jo u r s p lu s d ' ad ep t es . En e f f e t ,n o t r e so n d ag e f a i t r e s so r t ir d es v en t es n e t t emen tSUl~rieures h cel les de 1987 , pour les bo iss onsb ase d e co l a n o t amm en t . La p ro g ress io n d esch i f f r es d ' a f f a i r es r~ su l t e en g ran d e p a r t i e d el ' a c c ro i s s e m e n t d u v o l u m e d e s v e nt e s . L ' e m p l o ie t l e s i n v es t i s semen t s o n t 8 g a l emen t au g men tS .La nouvel le o rdonnance f&16rale sur les denr6esa l imen ta i r es co n cern an t en t r e au t r es l e s eau xmin 6 ra l es , en t r ee en v ig u eu r l e l e r av r i l 1 9 8 8ap rb s u n e p 6 r io d e t r an s it o i r e d e d eu x an s , ex ig esu r to u t u n e p lu s g ran d e co n s t an ce d an s l a q u a l it~e t u n e g a ran t i e d e l a p u re t&

    Th e o u tp u t i d en t i f i e s t h e a l i g n men t b e tweensen t en ces . M o s t En g l i sh sen t en ces match ex ac t l yo n e F ren ch sen t en ce , b u t i t i s p o ss ib l e fo r anEn g l i sh sen t en ce t o match two o r mo re F ren chsen t en ces . Th e f i r s t two En g l i sh sen t en ces(b e lo w) i l l u s t r a t e a p a r t i cu l a r l y h a rd case wh eretwo En g l i sh sen t en ces a l i g n t o two F ren chsen t en ces . No sm al l e r a l i g n men t s a r e p o ss ib leb ecau se t h e c l au se " . . . s a l es . .. were h ig h er . . . " i n

    177

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    2/8

    the f i rs t Engl ish sentence corresponds to (part of)the s econd French s en tence . T he nex t twoa l ignmen t s be low i l lu st ra t e the m ore typ ica l c a s ewhe re one E ng l i s h s en tence a l igns w i th exac t lyone French s en tence . T he f ina l a l ignmen t ma tchestwo E ng l i s h s en tences to a s ing le F rench s en tence .T hes e a l ignm en t s ag reed w i th the re s u l t s p roducedb y a h u m a n j u d g e .

    Table 2:Output from Al ignment ProgramEn gl i shFren chAccord ing to ou r s u rvey , 1988 s a l e s o f mine ra lwa te r and s o f t d r inks we re much h ighe r than in1987 , r e f l ec t ing the g row ing popu la r i ty o f the s ep roduc t s . Co la d r ink manu fac tu re r s in pa r ti cu la rach ieved above -ave rage g row th ra t e s.Quan t aux eaux min t ra l e s e t aux l imonades , e l l e srencon l ren t tou jou rs p lus d ' adep te s . E n e f fe t ,no t re s ond age fa i t r e s so r t i r de s ven te s ne t t emen tSUlX~rieures A celles de 1987, p ou r les boiss ons Ab a s e d e c o l a n o t a m m e n t .T he h ighe r tu rnove r was l a rge ly due to aninc rea s e in the s a l e s vo lum e .L a p rog re s s ion de s c h i f f r e s d 'a f fa i re s r# s u l t e eng rande pa r t i e de l ' a cc ro i s s emen t du vo lume desven tes .E m ploym en t and inves tmen t l eve l s a ls o c l imbed .L 'emplo i e t l e s inves t i s s emen t s on t #ga lemen taugmenUf.Fo l lowing a tw o-yea r t r ans it iona l pe r iod , t he newFoods tu f f s Ord inance fo r M ine ra l Wa te r c ameinto effe c t on Ap ri l 1 , 1988. Specif ica l ly , i tcon ta ins more s t r ingen t r equ i remen t s rega rd ingqua l i ty cons i s t ency and p u r i ty gua ran tee s .L a non ve l l e o rdonnance f&l&a le s u r l e s den r t e sa l imen ta i re s conce rnan t en t re au t re s l e s e auxmindra le s , en t ree en v iguenr l e l e r av r i l 1988ap r~ une lx f riode tmm i to i re de deux ans , ex igesurtout une plus g rand e cons tance darts la qual i t~e t une ga ran t i e de l a pu re t t .

    Al igning sentences is jus t a f i rs t s tep towardcons truct ing a probabi l is t ic d ic t ionary (Table 3)fo r u s e in a l ign ing words in mach ine t r ans la t ion(Brown e t a l . , 1990), or for cons truct ing ab i l ingua l concordance (T ab le 4 ) fo r u s e inl ex icography (K lavans and T zouke rm ann , 1990).

    Table 3 :An Entry in a Probabil is t ic Dictionary( f rom Brow n e t a l . , 1990 )Engl ish French Prob (Fren ch ] Engl ish)

    the le 0 .610the la 0 .178the 1' 0.083the les 0 .023the ce 0 .013the i l 0 .012the de 0 . 009the A 0.007the clue 0.007

    Table 4 : A B i l in g u a l C o n c o r d a n c eb a n k /b a n q u e ( " m o n e y " s e ns e )

    a n d the governor of thee t le gouveme ur de la

    800 per cent in one week through% ca une semaine ~ cause d' u t~b a n k /b a n c ( " p l a c e " s e n se )

    bank of canada hav e fwxluanflybcaque du canada ont fr&lnemm

    bank action. SENT therebanqu e. SENT voil~

    s u c h w a s t h e c a s e i n t h e georgesats-tmis et lc canada it Wolx~ du

    h e s a i d the nose and tail of the_~M__~c s ex tn~t ta du

    bank issue which was settled betwbanc de george.

    bank weresurrenderedbybanc. SEN T~ f a i r

    Al though the re ha s been s ome p rev ious work onthe s en tence a l ignmen t , e . g . , (B rown , L a i , andM erce r , 1991) , (Kay and R t s che i s en , 1988) ,(Cat izone e t a l . , to appear) , the a l ignment taskrema ins a s ign i f ic an t obs tac le p reven t ing ma nypo ten t ia l u s e r s f rom reap ing m any o f the bene f i t so f b i l ingua l co rpo ra , becaus e the p ropos edsolut ions are of ten unavai lable , unre l iable , and/orcompu ta t iona l ly p roh ib i tive .The al ign program i s ba s ed on a ve ry s imp les t a ti s t ic a l mode l o f cha rac te r leng ths . T he m ode lmakes us e o f the fac t t ha t l onge r s en tences in onelanguage t end to be t r ans la t ed in to longe rs en tences in the o the r l anguage , and tha t s ho r t e rsentences tend to be t rans la ted in to shortersentences . A probabi l is t ic score is ass igned toeach pa i r o f p ropos ed s en tence pai r s , ba s ed on thera t io o f l eng ths o f the two s en tences ( incharac ters ) and the variance of th is ra t io . Thisp robab i l i s t i c s co re i s u s ed in a dynamicprogramming f ramework in o rde r to f ind themax im um l ike l ihood a l ignmen t o f s en tences .

    1 7 8

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    3/8

    It i s r emark ab l e t h a t su ch a s imp le ap p ro ach canwo rk as we l l a s i t do es . An ev a lu a t i o n wasp er fo rm ed b ased o n a t r i li n g u a l co rp u s o f 1 5eco n o mic r ep o r t s i s su ed b y t h e Un io n Ban k o fS wi t ze r l an d (UBS ) i n En g l i sh , F ren ch an dGerm an (N = 1 4 ,68 0 wo rd s , 7 25 sen t en ces , an d1 8 8 p arag rap h s i n En g l i sh an d co r r esp o n d in gn u mb er s i n t h e o th e r two l an g u ag es) . Th e m eth o dco r r ec t l y a li g n ed a l l b u t 4 % o f t h e sen t en ces.M o reo v er , i t i s p o ss ib l e t o ex t r ac t a l a rg esu b co rp u s wh ich h as a mu ch smal l e r e r ro r r a t e .By se l ec t i n g t h e b es t s co r in g 8 0 % o f t h ea l i g n men t s , t h e e r ro r r a t e i s r ed u ced f ro m 4 % to0 . 7% . T h e r e w e r e r o u g h l y t h e sa m e n u m b e r o fe r ro r s in each o f t h e En g l i sh -F ren ch an d E n g l i sh -German a l i g n men t s , su g g es t i n g t h a t t h e meth o dma y b e f a ir l y l an g u ag e i n d ep en den t . W e b e l i ev eth a t t h e e r ro r r a t e i s co n s id e rab ly l o wer i n t h eCan ad i an Han sard s b ecau se t h e t r an s l a t i o n s a r emo re l i teral .

    2. A D ynamic Programm ing Framework

    No w , l e t u s co n s id e r h o w sen t en ces can b e a l i g n edwi th in a p a rag rap h . Th e p ro g ram ma k es u s e o fth e f ac t t h a t l o n g er sen t en ces i n o n e l an g u ag e t en dto b e t r an s la t ed i n to l o n g er sen t en ces i n t h e o th e rl an g u ag e , an d t h a t sh o r t e r s en t en ces t en d t o b et ranslated in to shor ter sen tences .2 A probabi l i s t icsco re i s a s s i g n ed t o each p ro p o sed p a i r o fsen t en ces , b ased o n t h e r a t i o o f l en g th s o f t h e twosen t en ces ( i n ch arac t e r s ) an d t h e v a r i an ce o f t h i s

    W e w i l l h a v e l it tl e o s a y a b o u t h o w s e n t e n c e b oa n de r i e sa m i d e n t i f i e d . I d e n t i f y i n g s e n t e n c e b o u n d a r i e s i s n o ta l w a y s a s e a s y a s i t m i g h t a p p e a r f o r m a s o n s d e s c r i b e d i nL i b e n n a n a n d C h u r c h ( to a p pe a r ) . It w o u l d b e m u c h e a s i e ri f p e r i o d s w e r e a l w a y s u s e d t o m a r k s e n t e n c e b ou n d ar i e s ,b u t u nf o r t u na t e ly , m a n y p e r i o d s h a v e o t h e r pu r p o s e s . I nt h e B r o w n C o r p us , f o r e x a m p l e , o n ly 9 0 % o f th e pe r i o d sa m u s e d to m a r k s e u t e n c e bo u n da r i e s ; t h e r e m a i n i n g 1 0 %a p p e a r i n n m n e r i c a l e x p r e s s i o n s , b b r e v i a ti o n s n d s o f o r th .In the Wa l l S t ree t Journal , t here i s even m o r e d i s c u s s i on o fd o l l a r a m o t m t s a n d p e r c e n t a g e s , a s w e l l a s m o r e u s e o fabbrev ia t ed t i t l es such as Mr.; c o n s e q u e n t l y , o n l y 5 3 % o ft h e p e r i o d s i n t h e t h e W a l l S t r e e t J o u r n a l a r e u s e d t oiden t i fy sen tence boundar i es .F o r t h e U B S d a t a , a s i m p l e s e t o f h e u r i s t ic s w e r e u s e d t oiden t i fy sen tences boundar i es . The datase t was su f f i c i en t lysma l l tha t i t was poss ib l e to co r rec t t he reznain ing mis t ak esb y h a n d . F o r a l a r g e r d a ta s e t , s u c h a s t h e C a n a d i a nH a n s a r d s , i t w a s n o t p o s s i b l e t o c h e c k t h e r e s u l t s b y h a n d .W e u s e d t h e s a m e p r o c e d u r e w h i c h i s u s e d i n ( C h u r c h ,1 9 8 8) . T h i s p r o c e d u re w a s d e v e l o p e d b y K a t h r y n B a k e r( p r iv a t e c o m m u n i c a t i o n ) .

    rat io . Th is p robabi l i s t ic score i s used in ad y n a m i c p r o g r a m m i n g f r a m e w o r k i n o r d e r t o f i n dth e max imu m l i k e l i h o o d a l i g n men t o f sen t en ces .We were l ed t o t h i s ap p ro ach a f t e r n o t i n g t h a t th el en g th s ( i n ch arac t e r s ) o f En g l i sh an d Germ anp arag rap h s a r e h ig h ly co r r e l a t ed ( .9 9 1 ) , a si l lus t rated in the fo l low ing f igure.

    Paragraph Len gth s are H igh ly Corre la ted

    0 Q

    Qb

    . . ' - . -. , . . . o

    * f ~ o "

    F igure 1 . The hodzonta l ax is shows thelength o f En g l is h paragraphs, wh i le thever t ica l scale s hows the lengths o f thecorrespond ing Germ an paragraphs. Notethat the correlat ion is quite large ( .9 91).D y n a m i c p r o g r a m m i n g i s o f t e n u s e d t o a l i g n t w os e q u e n c e s o f s y m b o l s i n a v ar i e t y o f s e tt i ng s , u c has g en e t i c co d e seq u en ces f ro m d i f f e r en t sp ec i es ,sp eech seq u en ces f ro m d i f f e r en t sp eak er s , g asch ro mato g rap h seq u en ces f ro m d i f f e r en tco mp o u n d s , an d g eo lo g i c seq u en ces f ro md i f f e r en t l o ca t io n s (S an k o f f an d K ru sk a l , 1 9 8 3 ).We co u ld ex p ec t t h ese match in g t ech n iq u es t o b eu se fu l , a s l o n g as t h e o rd er o f t h e sen t en ces d o esn o t d i f f e r t o o r ad i ca ll y b e tween t h e two l an g u ag es .De t a i l s o f t h e a l i g n men t t ech n iq u es d i f f e rco n s id e rab ly f ro m o n e ap p l i ca t io n t o an o th er , b u ta l l u se a d i s t an ce measu re t o co mp are twoin d iv id u a l e l emen t s wi th in t h e seq u en ces , an d ad y n amic p ro g rammin g a lg o r i t h m to min imize t h eto t a l d i s t an ces b e tween a l i g n ed e l emen t s wi th int w o s e q u e nc e s . W e h a v e f o u n d t h a t t h e s e n te n c eal ignment p rob lem f i t s fai r ly wel l in to th isf r amewo rk .

    17 9

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    4/8

    3 . T h e D i s t a nc e M e a s u r e

    I t i s co n v en i en t fo r t h e d i s t an ce measu re t o b eb ased o n a p ro b ab i l i s t ic mo d e l so t h a t i n fo rmat io ncan b e co mb in e d i n a co n s i s t en t way . Ou rd i s t an ce measu re i s an es t imat e o f- log Prob(match[8) , wh ere 8 d ep en d s o n !1 an d1 2 , t h e l en g th s o f t h e two p o r t i o n s o f t ex t u n d erco n s id e ra ti o n . Th e l o g i s i n t ro d u ced h ere so th a tad d in g d i s t an ces wi l l p ro d u ce d es i r ab l e r esul t s.Th i s d i s t an ce mea su re i s b ased o n t h e as su mp t io nth a t each ch arac t e r i n o n e l an g u ag e , L 1 , g iv es r i s eto a r an d o m n u m b er o f ch arac t e r s in t h e o th e rl an g u ag e , L2 . W e assu me th ese r an d o m v ari ab l esa re i n d ep en d en t an d i d en t i ca l l y d i s t ri b u t ed wi th an o rmal d i s t r ib u t i o n. Th e m o d e l i s t h en sp ec if i edb y t h e mean , c , an d v ar i an ce , s 2 , o f t h i sd i s t ri b u t io n , c i s th e ex p ec t e d n u m b er o fch arac t e r s i n L2 p er c h arac t e r i n L1 , an d s 2 i s t h ev a r i a n c e o f t h e n u m b e r o f c h a r a c t e r s i n L 2 p e rch arac t e r i n L I . W e d ef in e 8 t o b e( 1 2 - 1 1 c ) l ~ s 2 so t h a t i t h as a n o rmald i s t r i b u t i o n wi th mean ze ro an d v ar i an ce o n e (a tl e a s t w h e n t h e t w o portions o f t e x t u n d e rco n s id e ra t i o n ac tu a l l y d o h ap p en t o b e t r an sl a t io n so f o n e a n o t he r ) .Th e p a ram ete r s c an d s 2 a r e d e t e rmin ede m p i r i c a ll y f r o m t h e U B S d a t a . W e c o u l des t imat e c b y co u n t i n g t h e n u mb e r o f ch arac t e r s inGerman p arag rap h s t h en d iv id in g b y t h e n u mb ero f ch arac t e r s i n co r r esp o n d in g En g l i sh p a rag rap h s .W e o b t a in 8 1 1 0 5 1 7 3 4 8 1 = 1 .1 . Th e sameca l cu l a t i o n o n F ren ch an d En g l i sh p a rag rap h sy i e ld s c = 7 2 3 0 2 /6 8 4 5 0 = 1 .0 6 as t h e ex p ec t edn u mb er o f F ren ch ch arac t e r s p e r En g l i shch arac t e r s . As w i l l b e ex p l a in ed l a te r ,p e r fo rman ce d o es n o t seem to v e ry sen s i t i v e t oth ese p rec i se l an g u ag e d ep en d en t q u an t i t i e s , an dth ere fo re we s imp ly as su me c = 1 , w h ichs imp l i fi e s t h e p ro g ram co n s id erab ly .Th e mo d e l a s su m es t h a t s 2 i s p ro p o r t i o n a l t ol en g th. Th e co n s t an t o f p ro p o r t io n a l i t y i sd e t e rmin ed b y t h e s l o p e o f a ro b u s t r eg res s io n .Th e r esu l t f o r En g l i sh -Germ an i s s 2 = 7 .3 , an dfo r En g l i sh -F ren ch i s s 2 = 5 .6 . Ag a in , we h av efo u n d t h a t t h e d i f f e r en ce in t h e two s l o p es i s n o tt o o imp o r t an t. Th ere fo re , we can co m b in e th ed a t a ac ro ss l an g u ag es , an d ad o p t t h e s imp le rl an g u ag e i n d ep en d en t e s t imat e s 2 = 6 .8 , wh ich i swh a t i s ac tu a l l y u sed i n t h e p ro g ram.

    W e n o w a p p e a l t o B a y e s T h e o r e m t o e s t i m a t eProb (match l8 ) as a co n s t an t t imesProb(81m atch) Prob(match) . The co n s t an t canb e i g n o red s i n ce i t w i l l b e t h e same fo r a l lp ro p o sed match es . Th e co n d i t i o n a l p ro b ab i l i tyProb(8[match ) can b e es t imat ed b yProb(Slmatch ) = 2 (1 - P r o b ( l S I ) )w h e r e P r o b ( [ S I ) i s t h e p ro b ab i l i t y t ha t a r an d o mv ar i ab l e , z , w i th a s t an d ard i zed (mean ze ro ,v a r i an ce o n e) n o rmal d i s t r i b u t i o n , h as mag n i tu d eat leas t as large as 18 [Th e p ro g ram co mp u tes 8 d i r ec t l y f ro m th e l en g th so f t h e two p o r t i o n s o f t ex t , I i an d 1 2 , an d t h e tw op aramete r s , c an d s 2 . Th a t is ,8 = (12 - It c)l~f-~l s 2. T h e n , P r o b ( [ 8 1 ) i sco mp u ted b y i n t eg ra t i n g a s t an d ard n o rmald i s t r i b u t i o n (wi th mean ze ro an d v ar i an ce 1 ) .M an y s t a t i s t i c s t ex tb o o k s i n c lu d e a t ab l e fo rco mp u t in g t h i s.Th e p r i o r p ro b ab i l i t y o f a mat ch , Prob(match) , isfi t wi th t h e v a lu es i n Tab l e 5 (b e lo w) , wh ich w ered e t e r m i n e d f r o m t h e U B S d a ta . W e h a v e f o u n dth a t a sen t en ce in o n e l an g u ag e n o rm al ly match esex ac t l y o n e sen t en ce i n t h e o th e r l an g u ag e (1 -1 ) ,th ree add i t ional possib i l i t ies are also considered :1 -0 ( i n c lud in g 0 - I ) , 2 - I ( i n c lu d in g I -2 ) , an d 2 -2 .Tab le 5 shows a l l four possib i li t ies .

    T a b l e 5 : P r o b ( m a t e h )C a t e g o r y F r e q u e n c y P r o b (m a t c h )

    1-1 1167 0 .891-0 or 0 -1 13 0 .00992-1 or 1 -2 117 0 .089

    2-2 15 0 .0111312 1 .00

    Th i s co mp le t es t h e d i scu ss io n o f t h e d i s t an cemeasu re . Prob(matchlS) is c o m p u t e d a s a n( i r relevan t ) constan t t imesProb(S lmatch) Prob(match) . Prob(match) isco mp u ted u s in g t h e v a lu es i n Tab l e 5 .Prob(Slmatch ) i s co mp u ted b y as su min g t h a tProb(5]match) = 2 (1 - erob(151) ) , w h e r eProb (J 5 I ) has a s tandard norm al d is t r ibu t ion . W efi rs t calcu late 8 as (12 - 11 c)/~[-~1 s 2 and thene r o b ( 1 8 1 ) i s co mp u ted b y i n t eg ra t i n g a s t an d ardnormal d is t r ibu t ion .Th e d i s t an ce fu n c t i o n two side distance isdef ined in a gen eral w ay to al ] -ow for inser t ions ,

    1 8 0

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    5/8

    d e l e t i o n , su b s t it u t io n , e t c . Th e fu n c t i o n tak es fo u ra r g nm e n t s : x l , Y l , x 2 , Y 2 .

    1 . Le t two _ s ide_ dis tance ( x1 , Y l ; 0 , 0 ) b ethe cost o f subst i tu t ing x l wi th y 1 ,2 . t w o s i d e _ d i s t a n c e ( x l , 0 ; 0 , 0 ) b e t h e

    co s t o f d e l e t i n g Xl ,3 . t w o _ s i d e d i s t a n c e ( O , Yl ; 0 , 0 ) be theco s t o f i n se r t i o n o f y l ,4. two s ide_ d is tance ( x l , Y l ; xg ., O ) be the

    co s t o f co n tr ac t i n g x l an d x 2 t o y l ,5 . t w o _ s i d e d i s t a n c e ( x l , Yl ; 0 , Y2) be theco s t o f ex p an d in g x l t o Y1 an d y g , an d6 . t w o s i d e d i s t a n c e ( x l , Y l ; x 2 , y g.) b e t he

    co s t o f merg in g Xl an d x g . an d match in gwith y i a nd yg..

    4 . The Dynam ic Program ming Algor i thm

    Th e a lg o r i t h m i s su mmar i zed i n t h e fo l l o win grecu r s io n eq u a t io n . Le t s i , i = 1 . . . I , b e t hes e n te n c e s o f o n e l a ng u a g e, a n d t , j = 1 . - - J , b eth e t r an s l a t i o n s o f t h o se sen t en ces i n t h e o th e rl an g u ag e . Le t d b e t h e d i st an ce fu n c t i o n( two_side_dis tance) d escr i b ed i n t h e p rev io u ssect ion , and let D ( i , j ) b e t h e min imu m d i s t an ceb e twee n sen t en ces s l . " s i an d t h e ir t r an sl a t io n st l , " " t j , u n d er t h e max im u m l i k e l ih o o da l i g n men t . D ( i , j ) i s co m p u ted r ecu r s iv e ly , wh ereth e r ecu r r en ce min imizes o v er s i x cases(subst i tu t ion , delet ion , inser t ion , con t ract ion ,ex p an s io n an d m erg er ) wh ich , i n e f f ec t , imp o se aset o f s lope const rain ts . Tha t i s , DO,j) isca l cu l a t ed b y t h e fo l l o win g r ecu r r en ce wi th t h ein i t ia l condi t ion D ( i , j ) = O .

    D ( i , j ) =

    m in .

    D ( i , j - l ) + d ( 0 , t y; 0 , 0 )D ( i - l , j ) + d ( s i, O ; 0 , 0 )

    D ( i - 1 , j - l ) + d ( s i , t ) ; 0 , 0 )! D ( i - 1 , j - 2 ) + d ( s i , t : ; O , t j - 1 )! D ( i - 2 , j - l ) + d ( s i , I j ; S i - l , O )! D ( i - 2 , j - 2 ) + d ( s i , t j ; s i- 1 , t j - 1 )

    5 . Evaluat ion

    To ev a lu a t e align, i t s r e su l t s were co mp ared wi tha h u man a l i g n men t . A l l o f t h e UBS sen t en ceswere a l i g n ed b y a p r ima ry j u d g e , a n a t i v e sp eak ero f En g l i sh wi th a r ead in g k n o wled g e o f F ren chan d German . Tw o ad d i t io n a l j u d g es , a n a t i v esp eak er o f F ren ch an d a n a t i v e sp eak er o f German ,r esp ec t i v e ly , were u sed t o ch eck t h e p r imary j u d g eo n 4 3 o f t h e m o re d i f f i cu l t p a rag rap h s h av in g 2 3 0sen t en ces (o u t o f 1 1 8 t o t a l p a rag rap h s wi th 7 2 5sen t en ces ) . Bo th o f t h e ad d i ti o n a l j u d g es werealso f luen t in Engl ish , hav ing spe n t the las t fewy ear s l i v in g an d wo rk in g i n t h e Un i t ed S t a t es ,t h o ug h t h e y w e r e b o t h m o r e c o m f o r t a b l e w i thth e i r n a t iv e l an g u ag e t h an w i th En g l ish .Th e mate r i a l s were p rep ared i n o rd er t o mak e t h et ask so mew h at le s s t ed io u s fo r t h e j u d g es . Eachp arag rap h was p r i n t ed i n t h r ee co lu mn s , o n e fo reach o f t h e t h r ee l an g u ag es : En g l i sh , F ren ch an dGerm an . B l an k l i nes were i n se r t ed b e tweensen t en ces . Th e j u d g es were ask ed t o d raw l i n esb e tween match in g sen t en ces . Th e ju d g es we rea l so p e rmi t t ed t o d raw a l i n e b e tween a sen t en cean d "n u l l " i f t h ey t h o u g h t t h a t t h e sen t en ce wasn o t t ran s l at ed . F o r t h e p u rp o sed o f t hi sev a lu a t i o n , two sen t en ces were d e f i n ed t o" m a t c h " i f t h e y s h a r e d a c o m m o n c la u s e. ( I n af ew cases , a p a i r o f sen t en ces sh ared o n ly a p h raseo r a w o rd , r a t h e r th an a c l au se ; t h ese sen t en ces d idn o t c o u n t a s a " m a t c h " f o r th e p u rp o s e s o f th i sex p er imen t . )Af t e r ch eck in g t h e p r imary j u d g e w i th t h e o th e rtwo j u d g es , i t was d ec id e d t h a t th e p r imaryju d g e ' s r esu l ts w ere su f f i c i en t l y r e l i ab le t h a t t h eyco u ld b e u sed as a s t an d ard fo r ev a lu a t i n g t h ep r o g ra m . T h e p r i m a r y j u d g e m a d e o n l y t w omis t ak es o n t h e 4 3 h a rd p a rag rap h s (o n e F ren chmis t ak e an d o n e German mi s t ak e) , wh ereas t h ep ro g ram mad e 4 4 e r ro r s o n t h e same mate r i a l s .S i n c e t he p r i m a r y j u d g e ' s e r r o r r a t e i s s o m u c hlo wer t h an t h a t o f t h e p ro g ram, i t was d ec id ed t h a tw e n e e d n ' t b e c o n c e r n e d w i t h t h e p r i m a r y ju d g e ' se r ro r r a t e . I f t h e p ro g ram an d t h e j u d g e d i sag ree ,we can as su me th a t t h e p ro g ram i s p ro b ab lywro n g .T h e 4 3 " h a r d " p a r a g ra p h s w e r e s e l e ct e d b ylo o k in g fo r sen t en ces t h a t map p ed t o so meth in go th er t h an t h emse lv es a f t e r g o in g t h ro u g h b o thGerm an an d F ren ch . S p ec i f i ca l ly , f o r eachEn g l i sh sen t en ce , we a t t emp ted t o f i n d t h e

    1 8 1

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    6/8

    co r r esp o n d in g German sen t en ces , an d t h en fo reach o f t h em, we a t t emp ted t o f i n d th eco r r esp o n d in g F ren ch sen t en ces , an d t h en wea t t emp ted t o f i n d t h e co r r esp o n d in g En g l i shsen t en ces , wh ich sh o u ld h o p efu l l y g e t u s b ack t owh ere w e s t a rt ed . Th e 4 3 p arag rap h s i n c lu d ed al lsen t en ces i n wh ich t h i s p ro cess co u ld n o t b eco m p le t ed a ro u n d t h e lo o p . Th i s r e l a ti v e ly smal lg ro u p o f p a rag rap h s (2 3 p e rc en t o f a l l p a ragrap h s )co n t a in ed a r e l a t i v e ly l a rg e f r ac t i o n o f t h ep ro g ram' s e r ro r s (8 2 p e rcen t ) . Th u s , t h e re d o esseem to be some ver i f icat ion that th is t r i l ingualc r i t e r i o n d o es i n f ac t su cceed i n d i s t i n g u i sh in gmo re d i f f i cu l t p a rag rap h s f ro m l es s d i f f i cu l t o n es .Th ere a r e t h r ee p a i r s o f l an g u ag es : En g l i sh -G e r m a n , E n g l i sh - F r e n ch a n d F r e n c h -G e r m a n . W ewi l l r ep o r t j u s t t h e f i r s t two . (Th e t h i rd p a i r i sp ro b ab ly d ep en d en t o n t h e f ir s t two . ) E r ro r s a r er ep o r t ed wi th r esp ec t t o t h e j u d g e ' s r esp o n ses .T h a t i s, f o r e a c h o f t h e " m a t c h e s " t h a t t h ep r i m a r y j u d g e f o u n d , w e r e p o r t t h e p r o g ra m a sc o r r e c t f f it f o u n d t h e " m a t c h " a n d i n c o r r e c t f f i td i d n ' t T h i s c o n v e n t i o n a l l o w s u s t o c o m p a r ep er fo rman ce ac ro ss d i f f e r en t a l g o r i t h ms i n as t r a ig h t fo rward f ash io n .T h e p r o g r a m m a d e 3 6 e r r o r s o u t o f 62 1 t o ta la l i g n men t s (5 .8 % ) fo r En g l i sh -F ren ch , an d 1 9er ro r s o u t o f 6 9 5 (2 .7 % ) a l i g n men t s fo r En g l i sh -G e r m a n . O v e r a l l , th e r e w e r e 5 5 e r r o rs o u t o f at o t a l o f 1 3 1 6 a l ig n men t s (4 .2 % ) .

    h an d l ed co r r ec t l y . I n ad d i t i o n , wh en t h ealgor i thm assigns a se n tence to the 1 -0 category , i ti s a l so a lway s wro n g . C l ea r l y , mo re wo rk i sn eed ed t o d ea l w i th t h e 1 -0 ca t eg o ry . I t may b en ecessa ry t o co n s id e r l an g u ag e- sp ec i f i c meth o d sin o rder to deal adequa tely wi th th is case.We o b serv e t h a t t h e sco re i s a g o o d p red i c to r o fp er fo rman ce , an d t h e re fo re t h e sco re can b e u sedto ex t r ac t a l a rg e su b co rp u s wh ich h as a m u chsmal l e r e r ro r r a t e . By se l ec t in g t h e b es t s co r in g8 0 % o f t h e a l i g n men t s , t h e e r ro r r a t e can b ered u ced f ro m 4 % to 0 .7 % . In g en era l , we cant r ad e o f f t h e s ize o f t h e su b co rp u s an d t h eaccu rac y b y se t t i n g a t h r esh o ld , an d r e j ec t i n ga l i g n men t s wi th a sco re ab o v e t h i s t h r esh o ld .F ig u re 2 ex am in es t h i s t r ad e-o f f i n m o re d e t a i l.

    T a b l e 6 : C o m p l e x M a t c h e s a r e M o r e D i f fi c u ltca t eg o ry En g l i sh -F ren ch En g l i sh -German t o t a l

    N e r r % N e r r % N e r r %l - 0 o r 0 - 1

    1-12 -1 o r 1 -22 -23 - 1 o r ! - 3

    3 -2 o r 2 -3

    8 8 100542 14 2 .6

    59 8 149 3 3 31 1 1001 1 100

    5 5 10 0625 9 1 .4

    58 2 3 .46 2 331 1 1000 0 0

    13 13 1001167 23 2 .0

    117 10 915 5 332 2 100

    1 1 1 00

    Tab le 6 b reak s d o wn th e e r ro r s b y ca t eg o ry ,i l l u s t r a t i n g t h a t co mp lex match es a r e mo red i f f i cu lL I - I a l i g n men t s a r e b y f a r t h e eas i es t .T h e 2 - I a l ig n m e n t s, w h i c h c o m e n e x t , h a v e f o u rt imes t h e e r ro r r a t e fo r I - I . Th e 2 -2 a li g n men t sa re h a rd er s t i l l , b u t a majo r i t y o f t h e a l i g n men t sa re fo u n d . Th e 3 - I an d 3 -2 a li g n men t s a r c n o tev en co n s id e red b y t h e a lg o r i t h m, so n a tu ra l l y a llt h r ee a r e co u n t ed as e r ro r s . Th e m o s temb ar ras s in g ca t eg o ry i s I -0 , wh ich was n ev er

    1 8 2

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    7/8

    E x t r a c t in g a S u b c o r p u s w i t h L o w e r E r r o r R a t e

    ~r

    e~

    it

    o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . - - o . oi / | i i

    2 0 4 0 6 0 B0 t 0 0p~ mn t o( nmtminodaF~nrrmnts

    F i gure 2 . T he f a c t tha t the s c ore i s s uc h agood p red i c to r o f per f o rm a nc e c a n be us edt o e x t r a c t a l a r g e s u b c o r p u s w h i c h h a s am uc h s m a l le r e r ro r r a te . In genera l, w e c a nt ra de-o f f the s i z e o f the s ubc orpus a nd thea c c ura c y by -s et t i ng a th res ho l d , a n d re j ec ti nga l i gnm ents w i th a s c ore a bov e th i s th res ho l d .T he hor i z onta l a x i s s how s the s i z e o f thes u b c o r p u s , a n d t h e v e r t i c a l a x i s s h o w s t h ec o r r e s p o n d i n g e r r o r r a te . A n e r r o r r a t e o fa b o u t 2 / 3 % c a n b e o b t a i n e d b y s e l e c t i n g at h r e s h o l d t h a t w o u l d r e t a i n a p p r o x i m a t e l y8 0 % o f th e c o r p u s .L e s s fo rm a l t e s t s o f t h e e r ro r r a t e i n th e H a n s a rd ss u g g e s t t h a t t h e o v e ra l l e r ro r r a t e i s a b o u t 2 %,w h i l e th e e r ro r r a t e fo r t h e e a s y 8 0 % o f th es e n te n c e s i s a b o u t 0 . 4 %. A p p a re n t ly th e H a n s a rdt r a n s l a tio n s a re m o r e l i t e ra l t h a n th e U B S re p o r t s .I t t o o k 2 0 h o u rs o f r e a l t im e o n a s u n 4 to a l ig n3 6 7 d a y s o f H a n s a rd s , o r 3 . 3 m in u te s p e rH a n s a rd -d a y . T h e 3 6 7 d a y s o f H a n s a rd s c o n ta ina b o u t 8 9 0 , 0 0 0 s e n te n c e s o r a b o u t 3 7 m i l l i o n" w o r d s " ( t o ke n s ). A b o u t h a l f o f t h e c o m p u t e rt im e i s s p e n t i d e n t i fy in g to k e n s , s e n te n c e s , a n dp a ra g ra p h s , w h i l e t h e o th e r h a l f o f t h e t im e i sspen t in the align p ro g ra m i t se l f .6 . Me a s u r in g L e n g th In T e rm s O f Wo rd s R a th e r

    than C h a ra c te r s

    I t i s i n t e re s t in g to c o n s id e r w h a t h a p p e n s i f w ec h a n g e o u r d e f in i t i o n o f l e n g th to c o u n t w o rd sra th e r th a n c h a ra c te rs . I t m ig h t s e e m th a t w o rd sa re a more na tu ra l l ingu is t ic un i t than charac te rs

    1 8 3

    ( B r o w n , L a i a n d M e r c e r , 1 9 9 1 ). H o w e v e r , w eh a v e f o u n d t h a t w o r d s d o n o t p e r f o r m n e a r l y a sw e l l a s c h a ra c t e r s . In f a c t , t h e " w o r d s " v a r i a t io ni n c re a s es t h e n u m b e r o f e r r o r s d r a m a t i c a l l y ( f r o m3 6 to 5 0 fo r E n g l i s h -F re n c h a n d f ro m 1 9 to 3 5 fo rE n g l i s h - G e r m a n ) . T h e t o t a l e rr o r s w e r e t h e r e b yin c re a s e d f ro m 5 5 to 8 5 , o r f ro m 4 . 2 % to 6 . 5 %.W e b e l i e v e th a t c h a ra c t e r s a re b e t t e r b e c a u s e th e rea re m o re o f t h e m , a n d th e re fo re th e re i s l e ssu n c e r t a in ty . O n th e a v e ra g e , t h e ~ re a re 1 1 7chara cters per sentence (including wh ite s p a c e )a n d o n ly 1 7 w o rd s p e r s e n te n c e . R e c a l l t h a t w eh a v e m o d e le d v a r i a n c e a s p ro p o r t io n a l t o s e n te n c ele n g th , V = s 2 I . U s in g th e c h a ra c t e r d a t a , w ef o u n d p r e v i o u s l y t h a t s 2 = 6 . 5 . T h e s a m ea rg u m e n t a p p l i e d to w o rd s y i e ld s s 2 = 1 . 9 . F o rc o m p a r i s o n s a k e, i t i s u s e fu l t o c o n s id e r t h e r a t ioo f ~/(V(m))lm (or e q u iv a le n t ly , s l~m) , w h e r e mis th e m e a n s e n te n c e l e n g th . W e o b ta in f f (m)lmr a t io s o f 0 . 2 2 fo r c h a ra c t e r s a n d 0 . 3 3 fo r w o rd s ,in d ic a t in g th a t c ha ra c te rs a re l e s s noisy thanwords, and are therefore more sui table for use inalign.7. Conclusions

    T h i s p a p e r h a s p r o p o s e d a m e t h o d f o r a l i g n i n gs e n te n c e s in a b i li n g u a l c o rp u s , b a s e d o n a s im p lep ro b a b i l i s t i c m o d e l , d e s c r ib e d in S e c t io n 3 . T h em o d e l w a s m o t i v a t e d b y t h e o b s e r v a t i o n t ha tlo n g e r r e g io n s o f t e x t t e n d to h a v e lo n g e rt rans la t ions , and tha t shorter r e g io n s o f t e x t t e n dto h a v e s h o r t e r tr a n s l a tio n s . In p a r ti c u la r , w efo u n d th a t t h e c o r re l a t io n b e tw e e n th e l e n g th o f ap a ra g ra p h in c h a ra c t e r s a n d th e l e n g th o f i t st r a n s la t io n w a s e x t r e m e ly h ig h (0 . 9 9 1 ). T h i s h ig hc o r re l a t io n s u g g e s t s t h a t l e n g th m ig h t b e a s t ro n gc lu e fo r s e n te n c e a l ig n m e n t .A l th o u g h th i s m e th o d i s e x t r e m e ly s im p le , i t i sa l s o q u it e a c c u ra t e . O v e ra l l , t h e re w a s a 4 . 2 %e r ro r r a t e o n 1 3 1 6 a l ig n m e n t s , a v e ra g e d o v e r b o thE n g l i sh - F r e n ch a n d E n g l i s h - G e r m a n d a ta . I na d d i t io n , w e f in d th a t t h e p ro b a b i l i t y s c o re i s ag o o d p re d ic to r o f a c c u ra c y , a n d c o n s e q u e n t ly , i t i sp o s s ib l e to s e l e c t a s u b s e t o f 8 0 % o f th ea l i g n m e n t s w i th a m u c h s m a l l e r e r ro r r a t e o f o n l y0 . 7 % .T h e m e th o d i s a l s o f a i r ly l a n g u a g e - in d e p e n d e n t -B o t h E n g l i s h - F r e n c h a n d E n g l i s h - G e r m a n d a t aw e r e p r o c e s se d u s in g t h e s a m e p a r a m e t e r s . I fn e c e s s a ry , i t i s p o s s ib l e to f i t t h e s ix p a ra m e te r s i n

  • 8/14/2019 A Program for Aligning Sentences in Bilingual Corpora

    8/8

    the model with language-specific values, though,thus far, we have not found it necessary (or evenhelpful) to do so.We have examined a number of variations. Inparticular, we found that it is better to usecharacters rather than words in counting sentencelength. Apparently, the performance is better withcharacters because there is less variability in theratios of sentence lengths so measured. Usingwords as units increases the error rate by half,from 4.2% to 6.5%.In the future, we would hope to extend the methodto make use of lexical constraints. However, it isremarkable just how well we can do without suchconstraints. We might advocate the simplecharacter length alignment procedure as a usefulfirst pass, even to those who advocate the use oflexical constraints. The character lengthprocedure might complement a lexical conslraintapproach quite well, since it is quick but has someerrors while a lexical approach is probably slower,though possibly more accurate. One might gowith the character length procedure when thedistance scores are small, and back off to a lexicalapproach as necessary.

    Church, K., "A Stochastic Parts Program andNoun Phrase Parser for Unrestricted Text,"Second Conference on Applied NaturalLanguage Processing, Austin, Texas, 1988.Klavans, J., and E. Tzoukermann, (1990), "The

    BICORD System," COLING-90 , pp 174-179.Kay, M. and M. R6scheisen, (1988) "Text-Translation Alignment," unpublished ms.,Xerox Palo Alto Research Center.Liberman, M., and K. Church, (to appear), "'TextAnalysis and Word Pronunciation in Text-to-Speech Synthesis," in Fund, S., andSondhi, M. (eds.), Advances in SpeechSignal P rocessing.

    A CKNO W LE D G E M E NTS

    We thank Susanne Wolff and and EvelyneTzoukermann for their pains in aligning sentences.Susan Warwick provided us with the UBStrilingual corpus and posed the Ixoblem addressedhere.REFERENCESBrown, P., J. Cocke, S. Della Pietra, V. DellaPietra, F. Jelinek, J. Lafferty, R. Mercer,and P. Ro oss in, (1990) " A Sta t is tica lApproach to Machine Translation,"Computational Linguistics, v 16, pp 79-85.Brown, P., J. Lai, and R. Mercer, (1991)"Aligning Sentences in Parallel Corpora,'"ACL Conference, Berkeley.Catizone, R., G. Russell, and S. Warwick, (toappear) "Deriving Translation Data fromBilingual Texts," in Zernik (ed), Lexical

    A c q u i s i t i o n : U s i n g o n - l i n e R e s o u r c e s t oB u i l d a L e x i c o n , Lawrence Erlbaum.

    1 8 4