58
Text Readability

What is Readability? A characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect

Embed Size (px)

Citation preview

Page 1: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Text Readability

Page 2: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

What is Readability?

A characteristic of text documents..

“the sum total of all those elements within a given piece of printed material that affect the success of a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)

“ease of understanding or comprehension due to the style of writing” (Klare, 1963)

Page 3: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Text Readability/Difficulty Readability encompasses a number of

areas… Syntactic complexity of the text▪ grammatical arrangement of words within a

sentence, (e.g. active / passive sentences have been shown to affect readability)

▪ Simple/compound sentence/complex sentences Organization of text▪ discourse structure ▪ textual cohesion

Semantic complexity of the text

Page 4: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Why Measure Text Difficulty?

Improve literacy rate

Improving instruction delivery

Judging technical manuals

Matching text to appropriate grade level

And many more…

Page 5: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

How to Measure?

Assign score to text based on some textual cues (e.g., average sentence length) Readability formula Over 200 formulas by 1980s (DuBay

2004) Textual cues▪ sentence length, percentage of familiar

words, and word length, syllables per word etc.

Testing validity: correlating predicted score to reading comprehension score

Page 6: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Traditional Measures

Flesch Reading Ease score Score = 206.835 – (1.015 ASL) – (84.6

ASW) Score in [0 to 100] ASL = average sentence length ASW = average number of syllables per

word

Page 7: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Traditional Measures

Dale-Chall Formula Maintains a list of “easy words”. Score = .1579PDW + .0496ASL + 3.6365▪ PDW= Percentage of Difficult Words

FOG index Lexile scale

Commonalities among formulae Linear regression over some predictor

variables

Page 8: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Readability and Web Document

Traditional readability measures are robust for large sample size (textbook and essays) as compared to short and consize web documents.

Web documents are generally noisy

Resource: Predicting Reading Difficulty With Statistical Language Models, Kevyn Collins-Thompson and Jamie Callan

Page 9: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Statistical Language Model for Readability

LM can encode more complex relationships as compared to simple linear regression model in traditional readability measures

A probabilistic distribution in all grade levels

Relative difficulty of words can be obtained statistically as compared to hardcoded approach in traditional measures

Page 10: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Word Usage Across Grades

Earlier grade readers tend to use more concrete words (e.g. red); later grade readers use more abstract words (e.g., determine)

Same observations in web documents

Page 11: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Word Usage Statistics: Example

Page 12: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Word Usage Statistics: Example

Page 13: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Unigram Model of Readability

Syntactic features are ignored

Word (semantic) feature based model

Formulated in a classification framework For a given text passage , predict the

semantic difficulty of relative to a specific grade level ▪ Likelihood that the words of were generated

from a representative language model of

Page 14: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Unigram Model of Readability

𝐿𝑀𝐺1𝐿𝑀𝐺2

𝐿𝑀𝐺𝑛

Text

words

word

s

words

difficulty score difficulty score difficulty score

Page 15: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

How is a Text Generated?

Word type 1

Word type 2

Word type k

𝐿𝑀 (𝐺𝑖 )={𝑃 (𝑤1|𝐺𝑖 ) ,𝑃 (𝑤2|𝐺𝑖 ) ,…,𝑃 (𝑤𝑘∨𝐺𝑖)}

𝑻

Token

Page 16: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Multi-Nomial Distribution Example

“In a recent three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?”

Page 17: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

A Generative Model: Multi-Nomial Naïve Bayes (MNB)

Multi-nomial Distribution independent trials▪ Each of which leads to a success of exactly one

of categories▪ Each category has a given fixed success

probability▪ Probability mass function

Page 18: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Generative Model Assumptions Unigram language model Hypothetical author generates tokens of as

follows: Choosing a grade language model according to

prior probability distribution ▪ “I will write for grade level 4” [explicit]

Choosing a passage length according to probability distribution ▪ “I will write no more than 100 words” [Explicit/Implicit]

Sampling tokens from ’s multi-nomial word distribution▪ “I will pick up words with certain distribution” [Implicit]

Page 19: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB Model for Readability

We need to compute : Probability that is generated from LM Bayes’ Theorem Compute

Page 20: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB Model for Readability

Classification model

Page 21: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB Model for Readability

Classification model

]

Page 22: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB Model for Readability

Simplified assumptions All grades are equally likely a priori All passage lengths are equally likely

Simplified classification model

]

Page 23: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB Model for Readability

Simplified classification model

Page 24: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB for Readability: Example

Page 25: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB for Readability: Example

Example 1: Passage ”

Page 26: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

MNB for Readability: Example

Example 2: Passage T “the red perimeter”

Example 2: Passage T “the perimeter was optimal”

Page 27: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Smoothing

What if a word does not belong to a language model for a grade level A probability will be assigned Redistribute a part of probability mass of

known words to rare and unseen words

Page 28: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Smoothing Model

Smooth individual grade-based language model using Good-Turing smoothing We have estimate of total probability

mass of all unseen words We need to find each unseen word’s

share of this total probability mass Uniform probability distribution?

Page 29: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Smoothing Model

Usage of discriminative words are clustered towards grade levels. Borrow probability mass from

neighboring grade classes

Page 30: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Smoothing Model

Page 31: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Smoothing Model

The type w occurs in one or more grade models (which may or may not include )

▪ is a kernel distance function between i and k.▪ Gaussian Kernel

𝒊 𝒌

Page 32: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Indicators of Readability

Regression Model:

Readability Score assigned documents

𝒑𝟏 ,𝒑𝟐 ,…. ,𝒑𝒏Training

New doc

Readability ScoreResource: Revisiting Readability: A Unified Framework for Predicting

Text Quality, Emily Pitler and Ani Nenkova

Page 33: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Indicators of Readability

There are different predictor variables indicating readability score What is a the contribution of individual

predictor variable in readability score? Testing methodology

Collect Readability

Corpus

Extract Predictor Variable

Measure <readability

score, predictor variable>Correla

tion

Page 34: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Measure of Correlation

Pearson product-moment correlation coefficient () Captures relationship between two

variables that are linearly related .

Page 35: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Correlation Graphs

Page 36: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Measure of Correlation

+Ve

+Ve

-Ve

-Ve

Page 37: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Measure of Correlation

How statistically significant value is? t-test for statistical significance▪ Expressed through -▪ Computed through null hypothesis

the use of drug X to treat disease Y is no better than not using any drug

▪ - of 0.001 signifies ▪ there is a 1 in 100 chance that we would have seen these

observations if the variables were unrelated.

▪ If - computed for a dataset is less than predefined limit (say ), null hypothesis is rejected.▪ Correlation is statistically significant

Page 38: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

A Study on Readability Predictor Variables

Methodology Create a readability dataset ▪ “On a scale of 1 to 5, how well written is this

text?” Identify a group of predictor variables Measure correlation between readability

scores and value of predictor variable Decide on the effectiveness of predictor

variables based on correlation score and -

Page 39: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Baseline Measures

Average Characters/Word the average number of characters per word

Average Words/Sentence average number of words per sentence

Max Words/Sentence Maximum number of words per sentence

Text length Limit on -=

Page 40: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Vocabulary or Language Model Unigram model: probability of an article

, is the background corpus▪ Wall Street Journal and AP News corpus

Log-likelihood

This model will be biased towards shorter articles Why?

Compensation Linear regression with predictor variables as log-

likelihood and no of words in the article

Page 41: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Vocabulary or Language Model Log likelihood, WSJ

article likelihood estimated from a language model from WSJ Log likelihood, NEWS

article likelihood according to a unigram language model from NEWS LL with length, WSJ

Linear regression of WSJ unigram and article length LL with length, NEWS

Linear regression of NEWS unigram and article length

Page 42: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Syntactic Features

Average parse tree height Average number of noun phrases per sentence Average number of verb phrases per sentence Average number of subordinate clauses per

sentence Counting SBAR nodes in parse tree

Page 43: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Syntactic Features

Curious case of average verb phrases No of verb phrases per sentence may

increase the text complexity▪ average verb phrases should have a

negative correlation Let’s look at the following examples

It was late at night, but it was clear. The stars were out and the moon was bright. (1)

It was late at night. It was clear. The stars were out. The moon was bright. (2)

Page 44: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Lexical Cohesion Feature Aspects of well written discourse

Cohesive devices like pronouns, definite descriptions, topic continuity

Number of pronouns per sentence Number of definite articles per sentence Average cosine similarity Word overlap Word overlap over nouns and pronouns

Page 45: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity Coherence Features

Entity based approach towards local coherence discourse coherence is achieved in view

of the way discourse entities are introduced and discussed

Some entities are more salient than others▪ Salient entities are more likely to appear in

prominent syntactic positions (such as subject or object), and to be introduced in a main clause. ▪ Centering theory models the continuity of

discourse

Page 46: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity Coherence Features

Entity-Grid discourse representation Each text is represented by an entity

grid▪ A two-dimensional array that captures the

distribution of entities across text sentences.

Optional Resource: Modeling Local Coherence: An Entity-Based Approach, Regina Barzilay and Mirella Lapata

Page 47: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity-Grid Representation

Page 48: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity-Grid Representation

If a noun phrase appears more than once in a sentence, we resort to grammatical role based ranking [S>O>X] -- Sentence 1: ‘Microsoft’ appears as subject (S) and rest (X) category -- Mark entry for Microsoft as S

S => Entity appears in subject phrase

O => Entity appears in subject phrase

X => appears in any other phrase

=> does no appear

Page 49: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity-Grid as Feature Vector

A local entity transition is a sequence represents entity occurrences and their

syntactic roles in adjacent sentences Each transition will have certain

probability given a grid.

Text -> distribution defined over transition types

Page 50: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity-Grid as Feature Vector Feature vector

Probability counts for a fixed set of transition types Each grid rendering of document

▪ is the number of predefined transitions▪ is the probability of transition in grid

Page 51: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

What Entity-Grid is Good for?

Sentence Ordering Task determining an optimal sequence in

which to present a pre-selected set of information-bearing items▪ Concept-to-Text generation▪ Multi-document summarization

Simpler task▪ Rank alternative sentence ordering▪ Which from pair of ordering ( ) is better in terms of

coherence?

Page 52: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Modelling Order Ranking Task Training set

Ordered pairs of alternative rendering of same document .▪ Where degree of coherence for is greater than that of .

Training objective▪ To find parameter vector ▪ To yield a ranking score function that minimizes number of

violations of pairwise rankings provided in training set

Modelling

▪ Support Vector Machine Conctraint Optimization problem

Page 53: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Entity Coherence Features

Page 54: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Discourse Relation Features Consider a document as a bag of discourse

relations Language model defined over relations instead

of words Probability of a document generated with

number of relation tokens and number of relation types

Log-likelihood of a document based on its discourse relations

Page 55: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Discourse Relation Features

Increase in number of discourse relations in a document will lower the log-likelihood Number of relations in a document as

feature

Page 56: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Summary: Readability Predictor Study

Page 57: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

Summary

200+ readability measures and still counting

Are they really looking at deeper aspects of language comprehension?

Are they tuned towards individual reading abilities?

Is reader in the loop?

Page 58: What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect

How do we comprehend sentences? How do we store and access words? How do we resolve ambiguities?