Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Are there Group Differences in Essay-Writing Processes?
Mo Zhang, Randy Bennett, Paul Deane
Educational Testing Service, Princeton, NJ
Peter W. van Rijn
ETS Global BV
Presented as part of the symposium, “Test Performance Can Tell Us About Problem-Solving Processes: Implications for Building Equitable Assessment
Systems, (P. Carr, Chair), at the National Conference on Student Assessment, Philadelphia (PA), June 2016
Copyright © ETS 2016. All right reserved.
Overview
• Introduction to writing processes
• Current study’s research questions, method, and results
• Summary and limitations
2
Research on Writing Processes
• Stronger writers tend to:• Produce text efficiently in relatively long bursts
• Pause at natural locations for planning, such as at clause and sentence boundaries
• Engage in more editing and revision
• Weaker writers tend to:• Produce text less efficiently
• Pause in locations that suggest difficulties in typing, spelling, word-finding and other transcription processes
3
Keystroke Logging of Writing Processes
• Keystroke log (KL) records the mechanical and temporal characteristics of text production
• Type of action: Insertion, deletion, paste, jump, pause
• How long an action lasts (e.g., in terms of characters or time)
• Where an action occurs (e.g., within a word)
• When an action occurs in the writing session (e.g., at the beginning of writing, right before submission)
• These characteristics can be used for inferring cognitive processes implicated in writing
4
• Records, computes, and displays large variety of process features
• Allows research on the use of KL data for:
• Understanding basic writing processes
• Validating assessments
• Giving feedback to improve teaching and learning
5
ETS Keystroke Logging (KL) Engine
Relevant Research
• Zhang and Deane (2015)
• Computed 29 KL features from essays written by 6th-8th
grade students to a policy recommendation prompt
• Employed exploratory factor analysis to identify how features were organized
• Extracted four factors interpretable as:
• Fluency (e.g., inter-key interval)
• Macro editing (e.g., multi-word deletion)
• Local editing (e.g., corrected typo)
• Planning (e.g., time between phrasal bursts)
• Fluency and Planning were significantly related to essay scores
6
• Are there differences in writing processes across gender groups?
• If so, are those differences group characteristics or simply a function of disparities in skill level?
Research Questions
7
Instruments
• Two scenario-based ELA assessments, one focusing on argumentation and one on policy recommendation
• Three parallel forms of each assessment
• For each assessment:
• Scenario poses a controversial issue
• Lead-in tasks ask the student to:
• Read and summarize arguments in several given sources
• Critique those arguments
• Analyze the arguments
• Culminating essay asks the student to present his or her position and reasoning, using evidence from the sources
8
9
Participants
• Data collected in 2013 from a volunteer sample of schools in Idaho and several additional states
• Each student took two of three forms from one of two persuasive subgenres
• Argumentation forms• Administered to 1840 students in 7th-9th grade
• Policy Recommendation forms • Administered to 1523 students in 6th-8th grade
• Each form contained one essay task
10
• Discarded responses with zero essay scores (e.g., blank, off-topic), fewer than 25 words, or having response time less than 180 seconds
• Extracted the 29 process features used by Zhang and Deane (2015) from student logs for each of 6 essay prompts
• Factor analyzed each prompt separately, retaining 18 features that uniquely and consistently loaded on a factor for at least 4 of 6 prompts
• For each subgenre, create a dataset where each student responded to only one essay
• For each of the four process factors found in Zhang and Deane (2015), we standardized each feature score, summed the feature scores for each student, and rescaled the sum score to a 0-1 range to create a “process indicator”
• Merged records across essays and subgenres to increase sample size
Data Preparation
11
Essay Performance on Writing Fundamentals Averaged Across Six Prompts
Group Male Female
N 1022 1153
Essay Score (1-5)
Mean 2.6 2.8*
SD 1.0 0.9
Essay Length (in words)
Mean 182 218*
SD 94 103
12
*Difference between male and female means significant at p<.0001. Note. Schools did not report gender designations for 445 students
Process-Indicator Correlations (N=2619)
13
Dysfluency Macro Editing
Local Editing
Planning
Dysfluency .91
Macro Editing -0.21** .83
Local Editing -0.15** 0.22** .63
Planning 0.13** 0.12** -0.12** .59
Essay Score -0.34** 0.11** 0.05* 0.10**
*p<.01**p<.001
Note. Coefficient alpha values are in bold italics on main diagonal. All correlations and alpha values include students with and without gender designations.
Process-Indicator Performance Averaged Across Six Prompts
Group N Dysfluency Macro Editing
Local Editing
Planning
Observed Process-Indicator Means
Males 1022 0.52 0.33 0.21 0.52
Females 1152 0.36*** 0.48*** 0.38*** 0.44*
14
*Difference between male and female means significant at p<.05**Difference between male and female means significant at p<.01
***Difference between male and female means significant at p<.0001
Process-Indicator Performance Averaged Across Six Prompts
Group N Dysfluency Macro Editing
Local Editing
Planning
Observed Process-Indicator Means
Males 1022 0.52 0.33 0.21 0.52
Females 1152 0.36*** 0.48*** 0.38*** 0.44*
Adjusted Process-Indicator Means (Controlling for Essay Score)
Males 1022 0.50 0.34 0.21 0.53
Females 1152 0.38*** 0.47*** 0.38*** 0.43**
15
*Difference between male and female means significant at p<.05**Difference between male and female means significant at p<.01
***Difference between male and female means significant at p<.0001
Standardization Conditional-Difference Measure (Controlling for Essay Score)
Process Indicator
Group N Dysfluency Macro Editing
Local Editing Planning
Gender 2174 0.17 -0.23 -0.30 0.14
16
Note. Values => |.10| considered large.
Dysfluency: Observed Conditional Differences for Gender Groups (N=2174)
17
Macro Editing: Observed Conditional Differences for Gender Groups (N=2174)
18
Local Editing: Observed Conditional Differences for Gender Groups (N=2174)
19
Planning: Observed Conditional Differences for Gender Groups (N=2174)
20
• Compared gender groups on scenario-based essay performance and writing processes using features extracted from keystroke logs
• In comparison to males, females on average:
• Obtained higher essay scores
• Differed significantly in writing processes
• Entered text more fluently
• Engaged in more macro and local editing
• Did less real-time planning
• Writing-process differences persisted after controlling for essay score
Summary
21
Limitations
• Relatively small, unrepresentative examinee samples
• Significant number of students with missing gender information
• Test-based writing in only two subgenres
• Results collapsed across essays and subgenres
• Non-optimal matching variable
• Process-indicator interpretations require further validation
22
Are there Group Differences in Essay-Writing Processes?
Mo Zhang, Randy Bennett, Paul Deane
Educational Testing Service, Princeton, NJ
Peter W. van Rijn
ETS Global BV
Presented as part of the symposium, “Test Performance Can Tell Us About Problem-Solving Processes: Implications for Building Equitable Assessment
Systems, (P. Carr, Chair), at the National Conference on Student Assessment, Philadelphia (PA), June 2016
Copyright © ETS 2016. All right reserved.
Appendix
24
References
• Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and Standardization. In P. W. Holland and H. Wainer (Eds.), Differential item functioning (pp 35-67). Hillsdale, NJ: Lawrence Erlbaum Associates.
• Zhang, M., & Deane, P. (2015). Process features in writing: Internal structure and incremental value over product features (RR-15-27). Princeton, NJ: Educational Testing Service.
• Zhang, M., Dorans, N. J., Li, C., & Rupp, A. A. (in press). Differential feature functioning in automated essay scoring. In H. Jiao & R. Lissitz (Eds.), Test fairness in the new generation of large scale assessment.
25
Features Defining Each Indicator
• Dysfluency (lower scores suggest greater fluency)
• Inter-key pause: Extent to which pauses occur between keystrokes
• Word internal pause time: Extent to which pauses occur within words during text production
• Word final pause: Extent to which pauses occur just before typing the last character in a word
• Word initial pause: Extent to which pauses occur just after typing the first character in a word
• Word space pause: Extent to which pauses occur in between words
• Macro Editing (higher scores suggest more editing)
• Deleted characters: Extent to which deletion of characters occurs
• Proportion of characters in multiword deletion: Extent to which deletion of characters occurs in the process of deleting multiple words
• Multiword edit time: Extent of time spent in deleting multiple words
• Proportion of new content chunks: Extent to which deleted text is replaced with edited text with new content
26
Features Defining Each Indicator
• Local Editing (higher scores suggest more editing)
• Proportion of corrected typos: Extent to which correction of mistyped words occurs
• Proportion of major edits: Extent to which words are edited beyond minor correction
• Proportion of minor edits: Extent to which words are edited to make only minor corrections
• Proportion of word-choice events: Extent to which words are edited to produce completely different words
• Proportion of uncorrected spelling errors: Extent to which a spelling error occurs that is not corrected before another unrelated action is taken (positively loads on factor)
27
Features Defining Each Indicator
• Planning (higher scores suggest more planning)
• End sentence punctuation pause: Extent to which pauses occur at the juncture of sentences
• Start time: Extent to which a pause occurs prior to beginning writing (negatively loads on factor)
• Time spent between phrasal burst: Extent to which pauses occur between strings of fluent text production
• Proportion time spent at phrasal burst: Extent to which pauses occur at the beginning of a string of fluent text production
28
Data Analysis
• Compared mean scores for each process indicator between gender groups before and after controlling for human essay score
• Tested significance of indicator-score differences between gender groups using MANCOVA, controlling for essay score
• Computed a conditional-difference measure for each indicator based on Standardization approach to DIF (Dorans & Holland, 1993; Zhang et al., in press)
• Weighted average of the differences in expected group-indicator-scores taken across essay-score levels
• Between |.05 - .10| considered noticeable
• > |.10| considered large
29
• Current Study • First human rater
• Writing fundamentals rubric • Scale = 1-5• Median QWK (across 6 prompts) = .61
Essay Scores
30