Are there Group Differences in Essay-Writing Processes?...Keystroke Logging of Writing Processes •Keystroke log (KL) records the mechanical and temporal characteristics of text production

Are there Group Differences in Essay-Writing Processes?

Mo Zhang, Randy Bennett, Paul Deane

Educational Testing Service, Princeton, NJ

Peter W. van Rijn

ETS Global BV

Presented as part of the symposium, “Test Performance Can Tell Us About Problem-Solving Processes: Implications for Building Equitable Assessment

Systems, (P. Carr, Chair), at the National Conference on Student Assessment, Philadelphia (PA), June 2016

Copyright © ETS 2016. All right reserved.

Overview

• Introduction to writing processes

• Current study’s research questions, method, and results

• Summary and limitations

2

Research on Writing Processes

• Stronger writers tend to:• Produce text efficiently in relatively long bursts

• Pause at natural locations for planning, such as at clause and sentence boundaries

• Engage in more editing and revision

• Weaker writers tend to:• Produce text less efficiently

• Pause in locations that suggest difficulties in typing, spelling, word-finding and other transcription processes

3

Keystroke Logging of Writing Processes

• Keystroke log (KL) records the mechanical and temporal characteristics of text production

• Type of action: Insertion, deletion, paste, jump, pause

• How long an action lasts (e.g., in terms of characters or time)

• Where an action occurs (e.g., within a word)

• When an action occurs in the writing session (e.g., at the beginning of writing, right before submission)

• These characteristics can be used for inferring cognitive processes implicated in writing

4

• Records, computes, and displays large variety of process features

• Allows research on the use of KL data for:

• Understanding basic writing processes

• Validating assessments

• Giving feedback to improve teaching and learning

5

ETS Keystroke Logging (KL) Engine

Relevant Research

• Zhang and Deane (2015)

• Computed 29 KL features from essays written by 6th-8th

grade students to a policy recommendation prompt

• Employed exploratory factor analysis to identify how features were organized

• Extracted four factors interpretable as:

• Fluency (e.g., inter-key interval)

• Macro editing (e.g., multi-word deletion)

• Local editing (e.g., corrected typo)

• Planning (e.g., time between phrasal bursts)

• Fluency and Planning were significantly related to essay scores

6

• Are there differences in writing processes across gender groups?

• If so, are those differences group characteristics or simply a function of disparities in skill level?

Research Questions

7

Instruments

• Two scenario-based ELA assessments, one focusing on argumentation and one on policy recommendation

• Three parallel forms of each assessment

• For each assessment:

• Scenario poses a controversial issue

• Lead-in tasks ask the student to:

• Read and summarize arguments in several given sources

• Critique those arguments

• Analyze the arguments

• Culminating essay asks the student to present his or her position and reasoning, using evidence from the sources

8

9

Participants

• Data collected in 2013 from a volunteer sample of schools in Idaho and several additional states

• Each student took two of three forms from one of two persuasive subgenres

• Argumentation forms• Administered to 1840 students in 7th-9th grade

• Policy Recommendation forms • Administered to 1523 students in 6th-8th grade

• Each form contained one essay task

10

• Discarded responses with zero essay scores (e.g., blank, off-topic), fewer than 25 words, or having response time less than 180 seconds

• Extracted the 29 process features used by Zhang and Deane (2015) from student logs for each of 6 essay prompts

• Factor analyzed each prompt separately, retaining 18 features that uniquely and consistently loaded on a factor for at least 4 of 6 prompts

• For each subgenre, create a dataset where each student responded to only one essay

• For each of the four process factors found in Zhang and Deane (2015), we standardized each feature score, summed the feature scores for each student, and rescaled the sum score to a 0-1 range to create a “process indicator”

• Merged records across essays and subgenres to increase sample size

Data Preparation

11

Essay Performance on Writing Fundamentals Averaged Across Six Prompts

Group Male Female

N 1022 1153

Essay Score (1-5)

Mean 2.6 2.8*

SD 1.0 0.9

Essay Length (in words)

Mean 182 218*

SD 94 103

12

*Difference between male and female means significant at p<.0001. Note. Schools did not report gender designations for 445 students

Process-Indicator Correlations (N=2619)

13

Dysfluency Macro Editing

Local Editing

Planning

Dysfluency .91

Macro Editing -0.21** .83

Local Editing -0.15** 0.22** .63

Planning 0.13** 0.12** -0.12** .59

Essay Score -0.34** 0.11** 0.05* 0.10**

*p<.01**p<.001

Note. Coefficient alpha values are in bold italics on main diagonal. All correlations and alpha values include students with and without gender designations.

Process-Indicator Performance Averaged Across Six Prompts

Group N Dysfluency Macro Editing

Local Editing

Planning

Observed Process-Indicator Means

Males 1022 0.52 0.33 0.21 0.52

Females 1152 0.36*** 0.48*** 0.38*** 0.44*

14

*Difference between male and female means significant at p<.05**Difference between male and female means significant at p<.01

***Difference between male and female means significant at p<.0001

Process-Indicator Performance Averaged Across Six Prompts


Local Editing

Planning

Observed Process-Indicator Means

Males 1022 0.52 0.33 0.21 0.52

Females 1152 0.36*** 0.48*** 0.38*** 0.44*

Adjusted Process-Indicator Means (Controlling for Essay Score)

Males 1022 0.50 0.34 0.21 0.53

Females 1152 0.38*** 0.47*** 0.38*** 0.43**

15

*Difference between male and female means significant at p<.05**Difference between male and female means significant at p<.01

***Difference between male and female means significant at p<.0001

Standardization Conditional-Difference Measure (Controlling for Essay Score)

Process Indicator


Local Editing Planning

Gender 2174 0.17 -0.23 -0.30 0.14

16

Note. Values => |.10| considered large.

Dysfluency: Observed Conditional Differences for Gender Groups (N=2174)

17

Macro Editing: Observed Conditional Differences for Gender Groups (N=2174)

18

Local Editing: Observed Conditional Differences for Gender Groups (N=2174)

19

Planning: Observed Conditional Differences for Gender Groups (N=2174)

20

• Compared gender groups on scenario-based essay performance and writing processes using features extracted from keystroke logs

• In comparison to males, females on average:

• Obtained higher essay scores

• Differed significantly in writing processes

• Entered text more fluently

• Engaged in more macro and local editing

• Did less real-time planning

• Writing-process differences persisted after controlling for essay score

Summary

21

Limitations

• Relatively small, unrepresentative examinee samples

• Significant number of students with missing gender information

• Test-based writing in only two subgenres

• Results collapsed across essays and subgenres

• Non-optimal matching variable

• Process-indicator interpretations require further validation

22

Are there Group Differences in Essay-Writing Processes?

Mo Zhang, Randy Bennett, Paul Deane

Educational Testing Service, Princeton, NJ

Peter W. van Rijn

ETS Global BV

Presented as part of the symposium, “Test Performance Can Tell Us About Problem-Solving Processes: Implications for Building Equitable Assessment

Systems, (P. Carr, Chair), at the National Conference on Student Assessment, Philadelphia (PA), June 2016

Copyright © ETS 2016. All right reserved.

Appendix

24

References

• Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and Standardization. In P. W. Holland and H. Wainer (Eds.), Differential item functioning (pp 35-67). Hillsdale, NJ: Lawrence Erlbaum Associates.

• Zhang, M., & Deane, P. (2015). Process features in writing: Internal structure and incremental value over product features (RR-15-27). Princeton, NJ: Educational Testing Service.

• Zhang, M., Dorans, N. J., Li, C., & Rupp, A. A. (in press). Differential feature functioning in automated essay scoring. In H. Jiao & R. Lissitz (Eds.), Test fairness in the new generation of large scale assessment.

25

Features Defining Each Indicator

• Dysfluency (lower scores suggest greater fluency)

• Inter-key pause: Extent to which pauses occur between keystrokes

• Word internal pause time: Extent to which pauses occur within words during text production

• Word final pause: Extent to which pauses occur just before typing the last character in a word

• Word initial pause: Extent to which pauses occur just after typing the first character in a word

• Word space pause: Extent to which pauses occur in between words

• Macro Editing (higher scores suggest more editing)

• Deleted characters: Extent to which deletion of characters occurs

• Proportion of characters in multiword deletion: Extent to which deletion of characters occurs in the process of deleting multiple words

• Multiword edit time: Extent of time spent in deleting multiple words

• Proportion of new content chunks: Extent to which deleted text is replaced with edited text with new content

26


• Local Editing (higher scores suggest more editing)

• Proportion of corrected typos: Extent to which correction of mistyped words occurs

• Proportion of major edits: Extent to which words are edited beyond minor correction

• Proportion of minor edits: Extent to which words are edited to make only minor corrections

• Proportion of word-choice events: Extent to which words are edited to produce completely different words

• Proportion of uncorrected spelling errors: Extent to which a spelling error occurs that is not corrected before another unrelated action is taken (positively loads on factor)

27


• Planning (higher scores suggest more planning)

• End sentence punctuation pause: Extent to which pauses occur at the juncture of sentences

• Start time: Extent to which a pause occurs prior to beginning writing (negatively loads on factor)

• Time spent between phrasal burst: Extent to which pauses occur between strings of fluent text production

• Proportion time spent at phrasal burst: Extent to which pauses occur at the beginning of a string of fluent text production

28

Data Analysis

• Compared mean scores for each process indicator between gender groups before and after controlling for human essay score

• Tested significance of indicator-score differences between gender groups using MANCOVA, controlling for essay score

• Computed a conditional-difference measure for each indicator based on Standardization approach to DIF (Dorans & Holland, 1993; Zhang et al., in press)

• Weighted average of the differences in expected group-indicator-scores taken across essay-score levels

• Between |.05 - .10| considered noticeable

• > |.10| considered large

29

• Current Study • First human rater

• Writing fundamentals rubric • Scale = 1-5• Median QWK (across 6 prompts) = .61

Essay Scores

30

Documents

Are there Group Differences in Essay-Writing Processes?...Keystroke Logging of Writing Processes •Keystroke log (KL) records the mechanical and temporal characteristics of text production