Upload
lea
View
15
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Rodney D. Nielsen 1,2 , Wayne Ward 1,2 and James H. Martin 1 1 Center for Computational Language and Education Research, CU, Boulder 2 Boulder Language Technologies. Classification Errors in a Domain-Independent Assessment System. - PowerPoint PPT Presentation
Citation preview
Rodney D. Nielsen1,2, Wayne Ward1,2 and James H. Martin1
1 Center for Computational Language and Education Research, CU, Boulder
2 Boulder Language Technologies
Reference Answer: A long string produces a low pitch.(Lawrence Hall of Science 2006, Assessing Science Knowledge)
A harp has strings of different lengths. Describe how the sound of a longer string
differs from the sound of a shorter string.
When the string gets longer it makes the pitch
lower.
Classification Errors in a Domain-Independent Assessment System
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 2
Tailoring the Tutor’s Response
Question: A harp has strings of different lengths. Describe
how the sound of a longer string differs from the sound of a shorter string.
Reference answer: A long string produces a low pitch.
Learner answers: When the string gets longer it makes the pitch
lower. A long string produces a pitch. It makes a loud pitch. It makes a high pitch. If the string is tighter, the pitch is higher.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 3
Necessity of Finer-Grained Analysis Imagine a tutor only knowing that there is some
unspecified part of the reference answer that we are not sure the student understands
Reference Answer: A long string produces a low pitch. Break the reference answer down into low-level
facets derived from a dependency parse and thematic roles
NMod(string, long) The string is long. Agent(produces, string) A string is producing something. Product(produces, pitch) A pitch is being produced. NMod(pitch, low) The pitch is low.
Assess whether an understanding of each facet is implied by the student’s response
A long string produces a low pitch.
detnmod
detnmod
object
subject
Follow-up Question: Does a long string produce a higher or lower pitch.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 4
Representing Fine-Grained Semantics Assess the relationship between the student’s answer and the reference answer facets at a finer grain Reference Ans: A long string produces a low pitch.
NMod(string, long) Agent(produces, string) Product(produces, pitch) NMod(pitch, low)
ExpressedExpressedExpressedUnaddressed
A long string produces a pitch.
YesYesYesNo
AssumedExpressedExpressedDifferent
ArgumentIt produces a loud pitch.
AssumedExpressedExpressedContradiction
ExpressedIt produces a high pitch.
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 5
Answer Annotation Labels Understood: Facets that are understood by the student
Assumed: Assumed to be understood a priori based on the question
Expressed: Directly expressed or inferred by simple reasoning Inferred: Inferred by pragmatics or nontrivial logical reasoning
Contradicted: Facets contradicted by the learner answer Contra-Expr: Directly contradicted by negation, antonymous
expressions and their paraphrases Contra-Infr: Contradicted by pragmatics or complex reasoning
Self-Contra: Facets that are both contradicted and implied (self contradictions)
Diff-Arg: The core relation is expressed, but it has a different modifier or argument
Unaddressed: Facets that are not addressed at all by the student’s answer
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 6
Assessment Technology Overview Start with hand-generated reference answer facets Automatically parse reference & learner answer and
automatically extract representation Extract a feature vector for each reference answer
(RA) facet indicative of the student’s understanding of that facet
From answers, their automatic parses, the relations between these, and external corpus co-occurrence statistics
Train a machine learning classifier on the training set feature vectors
Use classifier to assess the test set answers, assigning one of five Tutor-Labels for each RA facet
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 7
Machine Learning FeaturesThe lexical entailment probabilities for the reference answer facet’s governor and modifier following (Glickman, Dagan and Koppel 2005 .; see also, Turney 2001)
Indicators of whether the reference answer governor’s (modifier’s) stem has an exact match in the learner answer
The lexical entailment probabilities for the primary constituent facets’ governors and modifiers when the facet in question represents a relation between propositions
The part of speech (POS) tags for the facet’s governor and modifier
The dependency or role type labels of the facet and the aligned learner answer dependency
The edit distance between the dependency path connecting the facet’s governor and modifier and the path connecting the aligned terms in the learner answer
True if the facet has a negation and the aligned learner dependency path has a single negation or if neither have a negation
The number of content words in the reference answer, motivated by the fact that longer answers were more likely to result in spurious alignments
Lexical
Syntactic
Other
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 8
Results (C4.5 decision tree)
Results on Tutor-Labels are: 24.4 and 15.4% over majority class baseline 19.4 and 5.9% over lexical baseline
# nonAsmdFacets
MajorityClass
LexicalBaseline
All Features
Training Set 10xCV 54,967 54.6 59.7 77.1Unseen Answers 3,159 51.1 56.1 75.5Unseen Modules 30,514 53.4 62.9 68.8
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 9
Error Analysis of Domain-Independent Asmt Leave-one-module-out cross-validation on the
13 training set science modules Train on 12 modules test on the held out module; do
this for each of the 13 modules Simulates Unseen Modules (domain-independent)
test set Trained and tested on all non-Assumed facets
Analyzed random selection of subset of errors 100 Expressed and 100 Unaddressed Consistently annotated by all annotators
Consider the factors involved in decision by humans
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 10
Errors in Expressed Facets Four main error factors by frequency:
72% Paraphrases 43% Phrase-based paraphrasing 35% Lexical substitution 26% Coreference 1% Syntactic alternation (Vanderwende et al.
2005) 22% Logical Inference 22% Pragmatics 6% Preprocessing Errors
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 11
Errors in Expressed Facets 43% Phrase-based paraphrasing
32 typical paraphrase occurrences in the middle versus halfway between one mineral will leave a scratch versus one
will scratch the other 14 uses of concept definitions
circuit versus electrical pathway 6 negations of antonyms
not a lot for a little no one has the same fingerprint for
everyone has a different print
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 12
Errors in Expressed Facets 35% Lexical substitution
Synonymy, hypernymy, hyponymy, meronymy, derivational changes, and other lexical paraphrases
Half detectable by broad coverage resource Tiny for small, CO2 for gas, put for place, pen for
ink and push for carry Many not easily detectable in lexical
resources put the pennies for distribute the pennies, and
have for contain
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 13
Errors in Expressed Facets 26% Coreference Resolution
15 pronouns (11 it, 3 she, 1 one) 6 NP term substitutions
Ref Ans: clay particles are lightLearner Ans: clay is the lightest
6 other common noun coreference issues
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 14
Errors in Expressed Facets 22% Logical inference
no, cup 1 would be a plastic cup 25 ml water and cup 2 paper cup 25 ml and 10 g sugar=> the two cups have the same amount of water
… it is easy to discriminate…=> the two sounds are very different
22% Pragmatics Because the vibrations
=> the rubberband is vibrating … the fulcrum is too close to the earth
=> the earth is the load in the system
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 15
Errors in Expressed Facets 6% Preprocessing errors
Normalization issues Parser errors
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 16
Errors in Expressed Facets Over half of the errors involved
more than one of the fine-grained factors There is a shadow there because the
sun is behind it and light cannot go through solid objects. Note, I think that question was kind of dumb.=> the tree blocks the light
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 17
Errors in Unaddressed Facets Many are questionable annotations
You could take a couple of cardboard houses and … 1 with thick glazed insulation. … =/> installing the insulation in the houses
Because the darker the color the faster it will heat up =/> darkest color
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 18
Errors in Unaddressed Facets Biggest source of error: lexical similarity Ignorance of context
[the electromagnet] has to be iron…=> steel is made from iron
Antonyms closer versus greater distance and
absorbs energy versus reflects energy Misguided trust
I learned it in class
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 19
Conclusion New assessment paradigm
Fine-grained facets and labels Corpus of 146K fine-grained inference
annotations Answer assessment system
24.4 and 15.4% over baseline results for in-domain and out-of-domain, respectively
First successful assessment of Grade 3-6 constructed responses
Error analysis provides insight into where future work is most appropriate
ACL-BEA, Jun 19, 2008, Rodney D. Nielsen 20
Thanks! We are grateful to the anonymous
reviewers, whose comments improved the paper, and the Lawrence Hall of Science for the data.
This work was partially funded by Award Numbers: NSF 0551723, IES R305B070434, and NSF DRL-0733323.