Classification Errors in a Domain-Independent Assessment System

Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1

1 Center for Computational Language and Education Research, CU, Boulder

2 Boulder Language Technologies

Reference Answer: A long string produces a low pitch.(Lawrence Hall of Science 2006, Assessing Science Knowledge)

A harp has strings of different lengths. Describe how the sound of a longer string

differs from the sound of a shorter string.

When the string gets longer it makes the pitch

lower.

Classification Errors in a Domain-Independent Assessment System

ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 2

Tailoring the Tutor’s Response

Question: A harp has strings of different lengths. Describe

how the sound of a longer string differs from the sound of a shorter string.

Reference answer: A long string produces a low pitch.

Learner answers: When the string gets longer it makes the pitch

lower. A long string produces a pitch. It makes a loud pitch. It makes a high pitch. If the string is tighter, the pitch is higher.


Necessity of Finer-Grained Analysis Imagine a tutor only knowing that there is some

unspecified part of the reference answer that we are not sure the student understands

Reference Answer: A long string produces a low pitch. Break the reference answer down into low-level

facets derived from a dependency parse and thematic roles

NMod(string, long) The string is long. Agent(produces, string) A string is producing something. Product(produces, pitch) A pitch is being produced. NMod(pitch, low) The pitch is low.

Assess whether an understanding of each facet is implied by the student’s response

A long string produces a low pitch.

detnmod

detnmod

object

subject

Follow-up Question: Does a long string produce a higher or lower pitch.


Representing Fine-Grained Semantics Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch.

NMod(string, long) Agent(produces, string) Product(produces, pitch) NMod(pitch, low)

ExpressedExpressedExpressedUnaddressed

A long string produces a pitch.

YesYesYesNo

AssumedExpressedExpressedDifferent

ArgumentIt produces a loud pitch.

AssumedExpressedExpressedContradiction

ExpressedIt produces a high pitch.


Answer Annotation Labels Understood: Facets that are understood by the student

Assumed: Assumed to be understood a priori based on the question

Expressed: Directly expressed or inferred by simple reasoning Inferred: Inferred by pragmatics or nontrivial logical reasoning

Contradicted: Facets contradicted by the learner answer Contra-Expr: Directly contradicted by negation, antonymous

expressions and their paraphrases Contra-Infr: Contradicted by pragmatics or complex reasoning

Self-Contra: Facets that are both contradicted and implied (self contradictions)

Diff-Arg: The core relation is expressed, but it has a different modifier or argument

Unaddressed: Facets that are not addressed at all by the student’s answer


Assessment Technology Overview Start with hand-generated reference answer facets Automatically parse reference & learner answer and

automatically extract representation Extract a feature vector for each reference answer

(RA) facet indicative of the student’s understanding of that facet

From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics

Train a machine learning classifier on the training set feature vectors

Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet


Machine Learning FeaturesThe lexical entailment probabilities for the reference answer facet’s governor and modifier following (Glickman, Dagan and Koppel 2005 .; see also, Turney 2001)

Indicators of whether the reference answer governor’s (modifier’s) stem has an exact match in the learner answer

The lexical entailment probabilities for the primary constituent facets’ governors and modifiers when the facet in question represents a relation between propositions

The part of speech (POS) tags for the facet’s governor and modifier

The dependency or role type labels of the facet and the aligned learner answer dependency

The edit distance between the dependency path connecting the facet’s governor and modifier and the path connecting the aligned terms in the learner answer

True if the facet has a negation and the aligned learner dependency path has a single negation or if neither have a negation

The number of content words in the reference answer, motivated by the fact that longer answers were more likely to result in spurious alignments

Lexical

Syntactic

Other


Results (C4.5 decision tree)

Results on Tutor-Labels are: 24.4 and 15.4% over majority class baseline 19.4 and 5.9% over lexical baseline

# nonAsmdFacets

MajorityClass

LexicalBaseline

All Features

Training Set 10xCV 54,967 54.6 59.7 77.1Unseen Answers 3,159 51.1 56.1 75.5Unseen Modules 30,514 53.4 62.9 68.8


Error Analysis of Domain-Independent Asmt Leave-one-module-out cross-validation on the

13 training set science modules Train on 12 modules test on the held out module; do

this for each of the 13 modules Simulates Unseen Modules (domain-independent)

test set Trained and tested on all non-Assumed facets

Analyzed random selection of subset of errors 100 Expressed and 100 Unaddressed Consistently annotated by all annotators

Consider the factors involved in decision by humans


Errors in Expressed Facets Four main error factors by frequency:

72% Paraphrases 43% Phrase-based paraphrasing 35% Lexical substitution 26% Coreference 1% Syntactic alternation (Vanderwende et al.

2005) 22% Logical Inference 22% Pragmatics 6% Preprocessing Errors


Errors in Expressed Facets 43% Phrase-based paraphrasing

32 typical paraphrase occurrences in the middle versus halfway between one mineral will leave a scratch versus one

will scratch the other 14 uses of concept definitions

circuit versus electrical pathway 6 negations of antonyms

not a lot for a little no one has the same fingerprint for

everyone has a different print


Errors in Expressed Facets 35% Lexical substitution

Synonymy, hypernymy, hyponymy, meronymy, derivational changes, and other lexical paraphrases

Half detectable by broad coverage resource Tiny for small, CO2 for gas, put for place, pen for

ink and push for carry Many not easily detectable in lexical

resources put the pennies for distribute the pennies, and

have for contain


Errors in Expressed Facets 26% Coreference Resolution

15 pronouns (11 it, 3 she, 1 one) 6 NP term substitutions

Ref Ans: clay particles are lightLearner Ans: clay is the lightest

6 other common noun coreference issues


Errors in Expressed Facets 22% Logical inference

no, cup 1 would be a plastic cup 25 ml water and cup 2 paper cup 25 ml and 10 g sugar=> the two cups have the same amount of water

… it is easy to discriminate…=> the two sounds are very different

22% Pragmatics Because the vibrations

=> the rubberband is vibrating … the fulcrum is too close to the earth

=> the earth is the load in the system


Errors in Expressed Facets 6% Preprocessing errors

Normalization issues Parser errors


Errors in Expressed Facets Over half of the errors involved

more than one of the fine-grained factors There is a shadow there because the

sun is behind it and light cannot go through solid objects. Note, I think that question was kind of dumb.=> the tree blocks the light


Errors in Unaddressed Facets Many are questionable annotations

You could take a couple of cardboard houses and … 1 with thick glazed insulation. … =/> installing the insulation in the houses

Because the darker the color the faster it will heat up =/> darkest color


Errors in Unaddressed Facets Biggest source of error: lexical similarity Ignorance of context

[the electromagnet] has to be iron…=> steel is made from iron

Antonyms closer versus greater distance and

absorbs energy versus reflects energy Misguided trust

I learned it in class


Conclusion New assessment paradigm

Fine-grained facets and labels Corpus of 146K fine-grained inference

annotations Answer assessment system

24.4 and 15.4% over baseline results for in-domain and out-of-domain, respectively

First successful assessment of Grade 3-6 constructed responses

Error analysis provides insight into where future work is most appropriate


Thanks! We are grateful to the anonymous

reviewers, whose comments improved the paper, and the Lawrence Hall of Science for the data.

This work was partially funded by Award Numbers: NSF 0551723, IES R305B070434, and NSF DRL-0733323.

Documents

Classification Errors in a Domain-Independent Assessment System