What is Readability? A characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect

Text Readability

What is Readability?

A characteristic of text documents..

“the sum total of all those elements within a given piece of printed material that affect the success of a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)

“ease of understanding or comprehension due to the style of writing” (Klare, 1963)

Text Readability/Difficulty Readability encompasses a number of

areas… Syntactic complexity of the text▪ grammatical arrangement of words within a

sentence, (e.g. active / passive sentences have been shown to affect readability)

▪ Simple/compound sentence/complex sentences Organization of text▪ discourse structure ▪ textual cohesion

Semantic complexity of the text

Why Measure Text Difficulty?

Improve literacy rate

Improving instruction delivery

Judging technical manuals

Matching text to appropriate grade level

And many more…

How to Measure?

Assign score to text based on some textual cues (e.g., average sentence length) Readability formula Over 200 formulas by 1980s (DuBay

2004) Textual cues▪ sentence length, percentage of familiar

words, and word length, syllables per word etc.

Testing validity: correlating predicted score to reading comprehension score

Traditional Measures

Flesch Reading Ease score Score = 206.835 – (1.015 ASL) – (84.6

ASW) Score in [0 to 100] ASL = average sentence length ASW = average number of syllables per

word

Traditional Measures

Dale-Chall Formula Maintains a list of “easy words”. Score = .1579PDW + .0496ASL + 3.6365▪ PDW= Percentage of Difficult Words

FOG index Lexile scale

Commonalities among formulae Linear regression over some predictor

variables

Readability and Web Document

Traditional readability measures are robust for large sample size (textbook and essays) as compared to short and consize web documents.

Web documents are generally noisy

Resource: Predicting Reading Difficulty With Statistical Language Models, Kevyn Collins-Thompson and Jamie Callan

Statistical Language Model for Readability

LM can encode more complex relationships as compared to simple linear regression model in traditional readability measures

A probabilistic distribution in all grade levels

Relative difficulty of words can be obtained statistically as compared to hardcoded approach in traditional measures

Word Usage Across Grades

Earlier grade readers tend to use more concrete words (e.g. red); later grade readers use more abstract words (e.g., determine)

Same observations in web documents

Word Usage Statistics: Example

Word Usage Statistics: Example

Unigram Model of Readability

Syntactic features are ignored

Word (semantic) feature based model

Formulated in a classification framework For a given text passage , predict the

semantic difficulty of relative to a specific grade level ▪ Likelihood that the words of were generated

from a representative language model of

Unigram Model of Readability

𝐿𝑀𝐺1𝐿𝑀𝐺2

𝐿𝑀𝐺𝑛

Text

words

word

s

words

difficulty score difficulty score difficulty score

How is a Text Generated?

Word type 1

Word type 2

Word type k

𝐿𝑀 (𝐺𝑖 )={𝑃 (𝑤1|𝐺𝑖 ) ,𝑃 (𝑤2|𝐺𝑖 ) ,…,𝑃 (𝑤𝑘∨𝐺𝑖)}

𝑻

Token

Multi-Nomial Distribution Example

“In a recent three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample?”

A Generative Model: Multi-Nomial Naïve Bayes (MNB)

Multi-nomial Distribution independent trials▪ Each of which leads to a success of exactly one

of categories▪ Each category has a given fixed success

probability▪ Probability mass function

Generative Model Assumptions Unigram language model Hypothetical author generates tokens of as

follows: Choosing a grade language model according to

prior probability distribution ▪ “I will write for grade level 4” [explicit]

Choosing a passage length according to probability distribution ▪ “I will write no more than 100 words” [Explicit/Implicit]

Sampling tokens from ’s multi-nomial word distribution▪ “I will pick up words with certain distribution” [Implicit]

MNB Model for Readability

We need to compute : Probability that is generated from LM Bayes’ Theorem Compute


Classification model


Classification model

]


Simplified assumptions All grades are equally likely a priori All passage lengths are equally likely

Simplified classification model

]


Simplified classification model

MNB for Readability: Example


Example 1: Passage ”


Example 2: Passage T “the red perimeter”

Example 2: Passage T “the perimeter was optimal”

Smoothing

What if a word does not belong to a language model for a grade level A probability will be assigned Redistribute a part of probability mass of

known words to rare and unseen words

Smoothing Model

Smooth individual grade-based language model using Good-Turing smoothing We have estimate of total probability

mass of all unseen words We need to find each unseen word’s

share of this total probability mass Uniform probability distribution?

Smoothing Model

Usage of discriminative words are clustered towards grade levels. Borrow probability mass from

neighboring grade classes

Smoothing Model

Smoothing Model

The type w occurs in one or more grade models (which may or may not include )

▪ is a kernel distance function between i and k.▪ Gaussian Kernel

𝒊 𝒌

Indicators of Readability

Regression Model:

Readability Score assigned documents

𝒑𝟏 ,𝒑𝟐 ,…. ,𝒑𝒏Training

New doc

Readability ScoreResource: Revisiting Readability: A Unified Framework for Predicting

Text Quality, Emily Pitler and Ani Nenkova

Indicators of Readability

There are different predictor variables indicating readability score What is a the contribution of individual

predictor variable in readability score? Testing methodology

Collect Readability

Corpus

Extract Predictor Variable

Measure <readability

score, predictor variable>Correla

tion

Measure of Correlation

Pearson product-moment correlation coefficient () Captures relationship between two

variables that are linearly related .

Correlation Graphs


+Ve

+Ve

-Ve

-Ve


How statistically significant value is? t-test for statistical significance▪ Expressed through -▪ Computed through null hypothesis

the use of drug X to treat disease Y is no better than not using any drug

▪ - of 0.001 signifies ▪ there is a 1 in 100 chance that we would have seen these

observations if the variables were unrelated.

▪ If - computed for a dataset is less than predefined limit (say ), null hypothesis is rejected.▪ Correlation is statistically significant

A Study on Readability Predictor Variables

Methodology Create a readability dataset ▪ “On a scale of 1 to 5, how well written is this

text?” Identify a group of predictor variables Measure correlation between readability

scores and value of predictor variable Decide on the effectiveness of predictor

variables based on correlation score and -

Baseline Measures

Average Characters/Word the average number of characters per word

Average Words/Sentence average number of words per sentence

Max Words/Sentence Maximum number of words per sentence

Text length Limit on -=

Vocabulary or Language Model Unigram model: probability of an article

, is the background corpus▪ Wall Street Journal and AP News corpus

Log-likelihood

This model will be biased towards shorter articles Why?

Compensation Linear regression with predictor variables as log-

likelihood and no of words in the article

Vocabulary or Language Model Log likelihood, WSJ

article likelihood estimated from a language model from WSJ Log likelihood, NEWS

article likelihood according to a unigram language model from NEWS LL with length, WSJ

Linear regression of WSJ unigram and article length LL with length, NEWS

Linear regression of NEWS unigram and article length

Syntactic Features

Average parse tree height Average number of noun phrases per sentence Average number of verb phrases per sentence Average number of subordinate clauses per

sentence Counting SBAR nodes in parse tree

Syntactic Features

Curious case of average verb phrases No of verb phrases per sentence may

increase the text complexity▪ average verb phrases should have a

negative correlation Let’s look at the following examples

It was late at night, but it was clear. The stars were out and the moon was bright. (1)

It was late at night. It was clear. The stars were out. The moon was bright. (2)

Lexical Cohesion Feature Aspects of well written discourse

Cohesive devices like pronouns, definite descriptions, topic continuity

Number of pronouns per sentence Number of definite articles per sentence Average cosine similarity Word overlap Word overlap over nouns and pronouns

Entity Coherence Features

Entity based approach towards local coherence discourse coherence is achieved in view

of the way discourse entities are introduced and discussed

Some entities are more salient than others▪ Salient entities are more likely to appear in

prominent syntactic positions (such as subject or object), and to be introduced in a main clause. ▪ Centering theory models the continuity of

discourse


Entity-Grid discourse representation Each text is represented by an entity

grid▪ A two-dimensional array that captures the

distribution of entities across text sentences.

Optional Resource: Modeling Local Coherence: An Entity-Based Approach, Regina Barzilay and Mirella Lapata

Entity-Grid Representation

Entity-Grid Representation

If a noun phrase appears more than once in a sentence, we resort to grammatical role based ranking [S>O>X] -- Sentence 1: ‘Microsoft’ appears as subject (S) and rest (X) category -- Mark entry for Microsoft as S

S => Entity appears in subject phrase

O => Entity appears in subject phrase

X => appears in any other phrase

=> does no appear

Entity-Grid as Feature Vector

A local entity transition is a sequence represents entity occurrences and their

syntactic roles in adjacent sentences Each transition will have certain

probability given a grid.

Text -> distribution defined over transition types

Entity-Grid as Feature Vector Feature vector

Probability counts for a fixed set of transition types Each grid rendering of document

▪ is the number of predefined transitions▪ is the probability of transition in grid

What Entity-Grid is Good for?

Sentence Ordering Task determining an optimal sequence in

which to present a pre-selected set of information-bearing items▪ Concept-to-Text generation▪ Multi-document summarization

Simpler task▪ Rank alternative sentence ordering▪ Which from pair of ordering ( ) is better in terms of

coherence?

Modelling Order Ranking Task Training set

Ordered pairs of alternative rendering of same document .▪ Where degree of coherence for is greater than that of .

Training objective▪ To find parameter vector ▪ To yield a ranking score function that minimizes number of

violations of pairwise rankings provided in training set

Modelling

▪ Support Vector Machine Conctraint Optimization problem


Discourse Relation Features Consider a document as a bag of discourse

relations Language model defined over relations instead

of words Probability of a document generated with

number of relation tokens and number of relation types

Log-likelihood of a document based on its discourse relations

Discourse Relation Features

Increase in number of discourse relations in a document will lower the log-likelihood Number of relations in a document as

feature

Summary: Readability Predictor Study

Summary

200+ readability measures and still counting

Are they really looking at deeper aspects of language comprehension?

Are they tuned towards individual reading abilities?

Is reader in the loop?

How do we comprehend sentences? How do we store and access words? How do we resolve ambiguities?

Documents

What is Readability? A characteristic of text documents.. “the sum total of all those elements within a given piece of printed material that affect