25
Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Embed Size (px)

Citation preview

Page 1: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Learning Paraphrases in HebrewArticle Overview and Initial work

Gabi Stanovsky

Page 2: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Definitions

• Paraphrase – “phrases, sentences, or longer natural language expressions that convey almost the same information”

• Textual Entailment – “pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true”

(Androutsopoulos and Malakasiotis, 2010)

אזרח ומאבטח השתלטו על אדם שרצה לשדוד

סניף דואר בב"שYNET, 14.11.11 ))

אזרח ומאבטח השתלטו על שודד בסניף בנק הדואר

המרכזי בב"ש NRG, 14.11.11) (

זכתה קבוצת 1999בשנת פנאתינייקוס בגביע

אירופה

עבר קטש 1999בשנת לקבוצת פנאתינייקוס

היוונית, ואף זכה להניף את גביע אירופה איתה

)ויקיפדיה(

Page 3: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Tasks

• Paraphrase Extraction – Extract paraphrases occurring within text.

• Paraphrase Identification – Determine if two given sentences are paraphrases

• Paraphrase Generation – Generate paraphrases of a given input sentence.

Page 4: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Common Stages in Learning Paraphrases

• Obtain monolingual corpus.

• Align paragraphs and sentences.

• Learn Paraphrases.

• Apply Learned rules to solve NLP tasks.

Page 5: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Research Questions

• Are there specific properties of the Hebrew language that allow paraphrasing?

• Which datasets can be used to collect and identify a database of paraphrases in Hebrew?

• Could approaches taken on other languages (especially English) be applied for Hebrew?

• How could paraphrases in Hebrew be learned (encoded) in order to help in NLP tasks?

Page 6: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Applications

• Article summarization

• Textual entailment

• Thesaurus

• Enrich automatic generation of text

• Machine translation

Page 7: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Alignment - Gale and Church:

They hypothesized that when looking at paraphrases, each character in the source sentence will give rise to a certain (language dependant) number of characters in the target language.

Page 8: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Alignment - Gale and Church:

This model combined with empirical results from their test corpus generated a fairly simple alignment algorithm, which only looks at the input sentences length.

Page 9: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky
Page 10: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Alignment - Gale and Church:

• Only allowed for alignments of the types below.

Page 11: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Paraphrase Identification (Barzilay, McKeown, 2001):

– Their dataset consists of multiple English translations of foreign books.

– Assumption: different translators will introduce paraphrases when translating the same source text.

Page 12: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Paraphrase Identification (Barzilay, McKeown, 2001):

– Continued by applying an iterative model for extracting paraphrases rules from aligned sentences.

– They have created rules of two types:Contextual rules, and morpho-syntactic rules,these two are co-trained on the aligned corpus and lexical paraphrases are extracted.

Page 13: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Contextual Rules:left1 = (VB0 TO1) right1 = (PRP$2 ,) “Tried to console her”

left2 = (VB0 TO1) right2 = (PRP$2 ,) “Tried to comfort her”

• Morpho-Syntactic Rules:

VB0 TO1 VB1 PRP1 “used to love her” VB0 TO1 VB2 NN1 IN PRP1 “used to feel affection for

her”

• Lexical Paraphrases:

(love , feel affection for)

Page 14: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Generation – Microsoft :

– The Microsoft NLP team created a system to produce paraphrases of an input English sentence.

– Their system gathered a large automated training set from news sites, upon which they performed sentence alignment

– They have used statistical learning tools upon this dataset to learn generation lattices

Page 15: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Generation –Malakasiotis and Androutsopoulos (Generate and rank):– Have created a method for ranking candidates for paraphrase which

gives weight to for grammaticality, meaning preservation, and diversity of the paraphrases.

– They have used this ranking component to create a new paraphrase generator. This generator creates many paraphrasing candidates by using other available techniques for paraphrasing.

– It then uses the ranking component to rank these candidates and returns the most likely ones

– Have published their dataset of paraphrase pairs with hand tagged judgment ranks.

Page 16: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Other Languages

• Extraction - Hashimoto et al (2011):– Their work concentrates on the extraction of Japanese

paraphrases from the web. – They scan the web for what they call a "definition sentence" – a

sentence which describes a term. – In order to identify such sentences they parse match them

against a sentential template – certain order of part of speech tags which their hypothesis claim that a definition sentence should adhere to.

– Following this, they have coupled sentences from the mining which contained the same subject, in assumption that this couple is likely to contain paraphrases.

• Using this method they report achieveing a large collection of 300K paraphrases with estimated precision of ~94%.

Page 17: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Previous WorkIn Hebrew

• (Ordan, Wintner. 2011):– have developed a medium scale Wordnet for Hebrew, consisting

of ~5300 groups of synonymous lexical items (synsets).

– The approach they have taken was to form the Wordnet by aligning English and Hebrew expressions, and infer relations from the English available Wordnet onto their created Hebrew Wordnet.

– They state that this method (called MultiWordNet) is preferable over building the Wordnet from scratch since the Hebrew language is poor on computational linguistic resources. The lack of monolingual dictionaries in Hebrew is given as an example of such resource.

Page 18: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Initial WorkData Mining

• Leading news sites will, with high probability, report on same event during a day’s time

• Collect hourly news headlines – our assumption is that finding paraphrases within a day’s mining is a simple task.

• Full story – richer examples?

Page 19: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Initial WorkData Mining – Examples

synonymסתיו שפיר נפגעה קל מאוד בהפגנה ליד בית שר האוצר

סתיו שפיר נפצעה קל מאוד בהפגנה מול בית שר האוצר

The badהולנד: להטיל סנקציות על הבנק

המרכזי באיראןצרפת: להטיל סנקציות בהיקף

חסר תקדים על איראן

The good

השר שלום: מפגן האחדות הפלסטיני מחסל שיחות ישירות

עם הרשות

השר שלום: מפגן האחדות הפלסטיני סותם הגולל על מומ

ישיר

Page 20: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Initial WorkHeadlines Alignment

• Baseline alignment method was created:

– For each two headlines in a day compute probability of alignment as (2 * #common words) / (#total words)

- For each news headline in a news source – align with a headline in another source for which the probability is over a certain threshold.

• Produces fairly good results

Page 21: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Initial WorkFull Stories Alignment

• Testing with dynamic programming approach (which gives weights to identical words) in order to align full stories seems to yield some interesting results

Page 22: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Initial WorkFull Stories Alignment

חכ זהבה גלאון ממרצ תקפה את ראש הממשלה, בנימין נתניהו.במהלך דיון בכנסת בעקבות חתימות של חכים: אם תנסה להרוס את

הדמוקרטיה, תקבל התקוממות עממית.הפרת את שבועת האמונים שלך לאזרחי המדינה ולחוקיה כאשר

התחלת בקמפיין לחיסול הדמוקרטיה במדינת ישראל. דמוקרטיה לא נבחנת רק בשלטון הרוב. אלא גם בכיבוד זכויות האדם של המיעוט

ואתה הפרת את שבועת האמונים שלך.

חברת הכנסת זהבה גלאון ממרצ טענה כי ראש הממשלה, בנימין ,נתניהו

הפר את שבועת האמונים שלו לאזרחי המדינה בכך שהחל בקמפיין לחיסול הדמוקרטיה במדינת ישראל: דמוקרטיה לא נבחנת רק בשלטון

,הרובאלא גם בכיבוד זכויות האדם של המיעוט. אתה הפרת את שבועת

האמונים שלך, כשהחלטת לחסל את המיעוט ולפגוע בזכויות היסוד שלו. ,אם תנסה להרוס את הדמוקרטיה, תקבל התקוממות עממית

הכריזה.

Page 23: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Initial WorkFull Stories Alignment

. חכ זהבה גלאון מממרצ תקפה את ראש הממשלה, 1. בנימין נתניהו.2. במהלך דיון בכנסת בעקבות חתימות של חכים: 3. אם תנסה להרוס את הדמוקרטיה, 4. תקבל התקוממות עממית.5. הפרת את שבועת האמונים שלך לאזרחי המדינה 6

ולחוקיה כאשר התחלת בקמפיין לחיסול הדמוקרטיה במדינת ישראל.

.דמוקרטיה לא נבחנת רק בשלטון הרוב. 7.אלא גם בכיבוד זכויות האדם של המיעוט ואתה הפרת 8

את שבועת האמונים שלך.9- .

10- .11 - .12- .

. חברת הכנסת זהבה גלאון ממרצ טענה כי ראש 1הממשלה,

,. בנימין נתניהו23- .4 - .5 - ..הפר את שבועת האמונים שלו לאזרחי המדינה בכך 6

שהחל בקמפיין לחיסול הדמוקרטיה במדינת ישראל:,. דמוקרטיה לא נבחנת רק בשלטון הרוב7אלא גם בכיבוד זכויות האדם של המיעוט. אתה . 8

הפרת את שבועת האמונים שלך,. כשהחלטת לחסל את המיעוט ולפגוע בזכויות היסוד 9

.אם תנסה להרוס את הדמוקרטיה, 10שלו. ,.תקבל התקוממות עממית11.הכריזה.12

Page 24: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Future Work Plan

• Align full stories using a baseline method (7.12)

• Provide a better alignment method:

– Using tagger in order to exploit POS knowledge. (14.12)

– Giving weight to Proper noun (e.g. names) (21.12) and Named Entities:

"The Cassini spacecraft, which is en route to Saturn,is about to make a close pass of the ringedplanet's mysterious moon Phoebe“

vs.:

"On its way to an extended mission at Saturn, theCassini probe on Friday makes its closest rendezvouswith Saturn's dark moon Phoebe.“

(C. Quirk, C. Brockett and W. Dolan (Microsoft Research), 2004)

Page 25: Learning Paraphrases in Hebrew Article Overview and Initial work Gabi Stanovsky

Future Work Plan:

• Publish alignments dataset (28.12) and estimate its precision rate.

• Try to incorporate LDA in the system (7.1) to get better results

• Try to formulate a method (14.1) for synonyms extraction of this dataset.

• Explore ways of learning and (21.1) encoding paragraph rules from the aligned dataset.