29
Data Analytics (1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Embed Size (px)

Citation preview

Page 1: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (1)

M.VlachosIBM Research – Zurich, Switzerland

How Difficult is a Foreign-Language Document?

Page 2: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (2)

Our Goal• Provide:

– semantic ‘sorting’ operator – for foreign documents

(with respect to the reader native language)– based on their perceived comprehensibility

Documents/Books on a topic

Easy < > Difficult

Page 3: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (3)

why is it useful ? (1/2)E-Bookstores:

Recommendations based on user’s language level

Page 4: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (4)

why is it useful ? (1/2)E-Bookstores:

Recommendations based on user’s language level

Easy Difficult

><

Page 5: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (5)

Web search/personalization: A lot of content overlap on the internet. Provide only a subset to the user, based on both:– Relevance– Document difficulty/comprehensibility

why is it useful ? (2/2)

Which documents should I read that

better correspond to my understanding of

the German language?

Page 6: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (6)

Background - Readability

• Manuals / Army Documents

Page 7: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (7)

Background - Readability• Zipf’s Law

“Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.”

Page 8: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (8)

Background - Readability• Flesch Reading Ease

100 0

90-100

11 year old

60-70

13-15 year old

0-30

University student

Microsoft Word

Page 9: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (9)

Readability 65 Readability 52

Page 10: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (10)

what makes the new problem

challenging/interesting?

Page 11: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (11)

Cognates• Many words in different languages exhibit

visual and semantic affinity– Derived words– ‘Loan’ words

“Ein Experte kam um die Maschine zu reparieren”

“An expert came to repair the machine.”

Page 12: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (12)

Page 13: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (13)

Compound Words

Page 14: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (14)

Compound Words

• German, Dutch, Swedish, etc are compound languages.• Complex words can be built from simpler ones

• Intuition: Even if a word cannot be found in a Dictionary (or has low frequency), if it consists of easy building blocks then it is also easy to understand

Page 15: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (15)

how to find word frequency?

• Better: Use web search engines!

Popularity of a word:

• Very large text corpora (eg project gutenberg)

Page 16: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (16)

Putting it all together

• An easy text contains:– Simple syntactical structure (e.g. no deeply connected sentences)– Easy words:

• frequently encountered – (eg. web frequency)• similar to my native language – cognates (finanzkrise = finance crisis)

• Combine these measures to deduce overall difficulty

Page 17: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (17)

Estimating Cognativity

Page 18: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (18)

Estimating Cognativity

Compute how easy it is to transform

one word into another…

j -> y (ja -> yes)

k -> c (Architekt -> architect)

z -> c (sozial -> social)

Common Letter Transformations:

Page 19: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (19)

Page 20: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (20)

Page 21: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (21)

Assembling everything

Page 22: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (22)

Page 23: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (23)

some experiments

Page 24: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (24)

Results – User Study

Ich habe mit dreissig Jahren angefangen, Deutsch zu lernen. Das war

ziemlich spät; ich glaube, wenn man jünger ist, ist es viel leichter, eine

Fremdsprache zu lernen. Aber ich wollte

es trotzdem versuchen. Mich interessierte die Deutsche Kultur, und

einige Mitarbeiter der Firma hatten die Aussicht, einmal in Deutschland

zu arbeiten. Also lernte ich Deutsch.

über mangelnde Beschäftigung während der Weihnachtsfeiertage

konnte sich die städtische Berufsfeuerwehr dieses Jahr wahrhaftig night

beklagen. Mehr als dreihundert Einsätze im gesamten Münchner

Stadtgebiet hielten Oberbranddirektor Wanninger und seine Mitarbeiter

rund um die Uhr in Atem. In den meisten Fällen konnten sie das Feuer

schnell unter Kontrolle bringen. Zwei Einfamilienhäuser und mehrere

Etagenwohnungen brannten jedoch vollständig aus.Das sogenannte Vorgesicht ist ein bis zum Schauen oder mindestens

deutlichem Hören gesteigertes

Ahnungsvermögen und hier in Westfalen so gewöhnlich, dass man

überall doch tatsächlich damit Behaftete trifft und im Grunde fast kein

Eingeborener sich gänzlich davon freimachen dürfte.Seine Gabe

überkommt ihn zu jeder Tageszeit, am häufigsten jedoch in

Mondnächten, wo er plötzlich

erwacht und von fieberhafter Unruhe ins Freie oder ans Fenster

getrieben wird. Er hört das Geschrei der Verunglückten und an Tür oder

Fensterläden das Anklopfen desjenigen, der ihn oder

seinen Nachfolger zur Hilfe rufen wird.

easy

medium

difficult

Page 25: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (25)

Comparing Readability vs Our Method

Page 26: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (26)

Comprehensibility consistently outperforms readability measures

• 300 Essays from: CourseInfo.com

GCSE

(high-school)

A-level

(pre-college preparation)

University Level

Page 27: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (27)

Page 28: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (28)

LingoRANK• A web tool for keyword-based news retrieval in German language• Semantic ranking of document based on comprehensibility

Page 29: Data Analytics(1) M.Vlachos IBM Research – Zurich, Switzerland How Difficult is a Foreign-Language Document?

Data Analytics (29)

In summary• Dynamic Corpus for Term Frequency

– Use search engines• Difficulty Depends on the Users’s Native Language

– Cognate Identification• Word Decompounding

– Building blocks simple to understand? -> Compound word is simple– Finanzminister (= Finance Minister)

Finanzminister

• We can mesh relevance and comprehensibility using a skyline ordering approach

“Customizing Search Results for Non-Native Speakers” (2012)

T. Lappas, M. Vlachos: International Conference on Information and Knowledge Management (CIKM)