Dissertation Final

1 Designing Intelligent Tutors That Adapt toWhen Students Game the System Ryan Shaun Baker December, 2005 Doctoral Dissertation Human-Computer Interaction Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA USA Carnegie Mellon University, School of Computer Science Technical Report CMU-HCII-05-104 Thesis Committee: Albert T. Corbett, co-chair Kenneth R. Koedinger, co-chair Shelley Evenson Tom Mitchell Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Copyright 2005 by Ryan Baker. All rights reserved. This research was sponsored in part by an NDSEG (National Defense Science and Engineering Graduate) Fellowship, and by National Science Foundation grant REC-043779. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies or endorsement, either express or implied, of the NSF, the ASEE, or the U.S. Government.2 Keywords: intelligent tutoring systems, educational data mining, human-computer interaction, gaming the system, quantitative field observations, Latent Response Models, intelligent agents 3 Abstract Students use intelligent tutors and other types of interactive learning environments in a considerable variety of ways. In this thesis, I detail my work to understand, automatically detect, and re-design an intelligent tutoring system to adapt to a behavior I term gaming the system. Students who game the system attempt to succeed in the learning environment by exploiting properties of the system rather than by learning the material and trying to use that knowledge to answer correctly. Within this thesis, I present a set of studies aimed towards understanding what effects gaming has on learning, and why students game, using a combination of quantitative classroom observations and machine learning. In the course of these studies, I determine that gaming the system is replicably associated with low learning. I use data from these studies to develop a profile of students who game, showing that gaming students have a consistent pattern of negative affect towards many aspects of their classroom experience and studies. Another part of this thesis is the development and training of a detector that reliably detects gaming, in order to drive adaptive support.In this thesis, I validate that this detector transfers effectively between 4 different lessons within the middle school mathematics tutor curriculum without re-training, suggesting that it may be immediately deployable to that entire curriculum. Developing this detector required developing new machine learning methods that effectively combine unlabeled data and labeled data at different-grain sizes in order to train a model to accurately indicate both which students were gaming, and when they were gaming. To this end, I adapted a modeling framework from the Psychometrics literature Latent Response Models (Maris, 1995), and used a variant of Fast Correlation-Based Filtering (Yu and Liu 2003) to efficiently search the space of potential models. The final part of this thesis is the re-design of an existing intelligent tutoring lesson to adapt to gaming. The re-designed lesson incorporates an animated agent (Scooter the Tutor) who indicates to the student and their teacher whether the student has been gaming recently. Scooter also gives students supplemental exercises, in order to offer the student a second chance to learn the material he/she had gamed through. Scooter reduces the frequency of gaming by over half, and Scooters supplementary exercises are associated with substantially better learning; Scooter appears to have had virtually no effect on the other students. 4 Acknowledgements The list of people that I should thank for their help and support in completing this dissertation would fill an entire book. Here, instead, is an incomplete list of some of the people I would like to thank for their help, support, and suggestions. Angela Wagner, Ido Roll, Mike Schneider, Steve Ritter, Tom McGinnis, and Jane Kamneva assisted in essential ways with the implementation and administration of the studies presented in this dissertation. None of the studies presented here could have occurred without the support of Jay Raspat, Meghan Naim, Dina Crimone, Russ Hall, Sue Cameron, Frances Battaglia, and Katy Getman, in welcoming me into their classrooms. The ideas presented in this dissertation were refined through conversations with Ido Roll, Santosh Mathan, Neil Heffernan, Aatish Salvi, Dan Baker, Cristen Torrey, Darren Gergle, Irina Shklovski, Peter Scupelli, Aaron Bauer, Brian Junker, Joseph Beck, Jack Mostow, Carl diSalvo, and Vincent Aleven. My committee members, Shelley Evenson and Tom Mitchell, helped to shape this dissertation into its present form, teaching me a great deal about design and machine learning in the process. My advisors, Albert Corbett and Kenneth Koedinger, were exceptional mentors, and have guided me for the last five years in learning how to conduct research effectively, usefully, and ethically I owe an immeasurable debt to them. Finally, I would like to thank my parents, Sam and Carol, and my wife, Adriana. Their support guided me when the light at the end of the dissertation seemed far. 5 Table of Contents I II III IV V VI Introduction Gaming the System and Learning Detecting Gaming Understanding Why Students Game Adapting to Gaming Conclusions and Future Work References 7 12 21 41 54 79 83 Appendices A B C Cognitive Tutor Lessons Learning Assessments Gaming Detectors 87 94 108 6 Chapter One Introduction In the last twenty years, interactive learning environments and computerized educational supports have become a ubiquitous part of students classroom experiences, in the United States and throughout the world.Many such systems have become very effective at assessing and responding to differences in student knowledge and cognition (Corbett and Anderson 1995; Martin and vanLehn 1995; Arroyo, Murray, Woolf, and Beal 2003; Biswas et al 2005). Systems which can effectively assess and respond to cognitive differences have been shown to produce substantial and statistically significant learning gains, as compared to students in traditional classes (cf. Koedinger, Anderson, Hadley, and Mark 1997; vanLehn et al 2005). However, even within classes using interactive learning environments which have been shown to be effective, there is still considerable variation in student learning outcomes, even when each students prior knowledge is taken into account. The thesis of this dissertation is that a considerable amount of this variation comes from differences in how students choose to use educational software, that we can determine which behaviors are associated with poorer learning, and that we can develop systems that can automatically detect and respond to those behaviors, in a fashion that improves student learning. In this dissertation, I present results showing that one way that students use educational software, gaming the system, is associated with substantially poorer learning much more so, in fact, than if the student spent a substantial portion of each class ignoring the software and talking off-task with other students (Chapter 2). I then develop a model which can reliably detect when a student is gaming the system,across several different lessons from a single Cognitive Tutor curriculum (Chapter 3). Using a combination of the gaming detector and attitudinal questionnaires, I compile a profile of the prototypical gaming student, showing that gaming students differ from other students in several respects (Chapter 4). I next combine the gaming detector and profile of gaming students, in order to re-design existing Cognitive Tutor lessons to address gaming. My re-design introduces an interactive agent, Scooter the Tutor, who signals to students (and their teachers) that he knows that the student is gaming, and gives supplemental exercises targeted towards the material students are missing by gaming (Chapter 5). Scooter substantially decreases the incidence of gaming, and his exercises are associated with substantially better learning. In Chapter 6, I discuss the larger implications of this dissertation, advancing the idea of interactive learning environments that effectively adapt not just to differences in student cognition, but differences in student choices. Gaming the System I define Gaming the System as attempting to succeed in an educational environment by exploiting properties of the system rather than by learning the material and trying to use that knowledge to answer correctly. Gaming strategies are seen by teachers and outsiders as misuse of the software the student is using or system that the student is participating in, but are distinguished from cheating in that gaming does not violate explicit rules of the educational setting, as cheating does. In fact, in some situations students are encouraged to game the system for instance, several test preparation companies teach students to use the structure of how SAT 7 questions are designed in order to have a higher probability of guessing the correct answer. Cheating on the SAT, by contrast, is not recommended by test preparation companies. Gaming the System occurs in a wide variety of different educational settings, both computerized and offline. To cite just a few examples: Arbreton (1998) found that students ask teachers or teachers aides to give them answers to math problems before attempting the problems themselves. Magnussen and Misfeldt (2004) have found that students take turns intentionally making errors in collaborative educational games in order to help their teammates obtain higher scores; gaming the system has also been documented in other types of educational games (Klawe 1998; Miller, Lehman, and Koedinger 1999). Cheng and Vassileva (2005) have found that students post irrelevant information in large quantities to newsgroups in online courses which are graded based on participation. Within intelligent tutoring systems, gaming the system has been particularly well-documented. Schofield (1995) found that some students quickly learned to ask for the answer within a prototype intelligent tutoring system which did not penalize help requests, instead of attempting to solve the problem on their own a behavior quite similar to that observed by Arbreton (1998).Wood and Wood (1999) found that students quickly and repeatedly ask for help until the tutor gives the student the correct answer, a finding replicated by Aleven and Koedinger (2000).Mostow and his colleagues (2002) found in a reading tutor that students often avoid difficulty by re-reading the same story over and over.Aleven and his colleagues (1998) found, in a geometry tutor, that students learn what answers are most likely to be correct (such as numbers in the givens, or 90 or 180 minus one of those numbers), and try those numbers before thinking through a problem. Murray and vanLehn (2005) found that students using systems with delayed hints (a design adopted by both Carnegie Learning (Aleven 2001) and by the AnimalWatch project (Beck 2005) as a response to gaming) intentionally make errors at high speed in order to activate the softwares proactive help. Withintheintelligenttutoringsystemswestudied,weprimarilyobservedtwotypesofgaming the system:1.quickly and repeatedly asking for help until the tutor gives the student the correct answer(as in Wood and Wood 1999; Aleven and Koedinger 2000) 2.inputting answers quickly and systematically. For instance, entering 1,2,3,4, or clicking every checkbox within a set of multiple-choice answers, until the tutor identifies a correct answer and allows the student to advance. In both of these cases, features designed to help a student learn curricular material via problem-solving were instead used by some students to solve the current problem and move forward within the curriculum. The Cognitive Tutor Classroom All of the studies that I will present in this dissertation took place in classes using Cognitive Tutor software (Koedinger, Anderson, Hadley, and Mark 1995).In these classes, students complete mathematics problems within the Cognitive Tutor environment. The problems are designed so as to reify student knowledge, making student thinking (and misconceptions) visible. A running cognitive model assesses whether the students answers map to correct understanding 8 or to a known misconception. If the students answer is incorrect, the answer turns red; if the students answers are indicative of a known misconception, the student is given a buggy message indicating how their current knowledge differs from correct understanding (see Figure 1-1). Cognitive Tutors also have multi-step hint features; a student who is struggling can ask for a hint. He or she first receives a conceptual hint, and can request further hints, which become more and more specific until the student is given the answer (see Figure 1-2). Students in the classes studied used the Cognitive Tutor 2 out of every 5 or 6 class days, devoting the remaining days to traditional classroom lectures and group work. In Cognitive Tutor classes, conceptual instruction is generally given through traditional classroom lectures however, in order to guarantee that all students had the same conceptual instruction in our studies, we used PowerPoint presentations with voiceover and simple animations to deliver conceptual instruction (see Figure 1-3). The research presented in this dissertation was conducted in classrooms using a new Cognitive Tutor curriculum for middle school mathematics (Koedinger 2002), in two suburban school districts near Pittsburgh. The students participating in these studies were in the 7th-9th grades (predominantly 12-14 years old).In order to guarantee that students were familiar with the Cognitive Tutor curriculum, and how to use the tutors (and presumably how to game the system if they wanted to), all studies were conducted in the Spring semester, after students had already been using the tutors for several months. Figure 1-1: The student has made an error associated with a misconception, so they receive a buggy message (top window). The students answer is labeled in red, because it is incorrect (bottom window). 9 Figure 1-2: The last stage of a multi-stage hint: The student labels the graphs axes and plots points in the left window; the tutors estimates of the students skills are shown in the right window; the hint window (superimposed on the left window) allows the tutor to give the student feedback. Other windows (such as the problem scenario and interpretation questions window) are not shown. Figure 1-3: Conceptual instruction was given via PowerPoint with voice-over, in the studies presented within this dissertation. Effectiveness of Existing Cognitive Tutors It is important, before discussing how some students succeed less well in Cognitive Tutors than others, to remember that Cognitive Tutors are an exceptionally educationally effective type of learning environment overall. Cognitive Tutors have been validated to be highly effective across a wide variety of educational domains and studies. To give a few examples, a Cognitive Tutor for the LISP programming language achieved a learning gain almost two standard deviations better than an unintelligent interactive learning environment (Corbett 2001); a Cognitive Tutor for Geometry proofs resulted in test scores a letter grade higher than students learning about Geometry proofs in a traditional classroom (Anderson, Corbett, Koedinger, and Pelletier 1995); and an Algebra Cognitive Tutor has shown in a number of studies conducted nationwide to not only lead to better scores on the Math SAT standardized test than traditional curricula 10 (Koedinger, Anderson, Hadley, and Mark 1997), but to also result in a higher percentage of students choosing to take upper-level mathematics courses (Carnegie Learning 2005). In recent years, the Cognitive Tutor curricula have come into use in an increasing percentage of U.S. high schools about 6% of U.S. high schools as of the 2004-2005 school year. Hence, the goal of the research presented here is not to downgrade in any way the effectiveness of Cognitive Tutors. Cognitive Tutors are one of the most effective types of curricula in existence today, across several types of subject matter.Instead, within this dissertation I will attempt to identify a direction that may make Cognitive Tutors even better. A majority of students use Cognitive Tutors thoughtfully, and have excellent learning gains; a minority, however, use tutors less effectively, and learn less well. The goal of the research presented here is to improve the tutors for the students who are less well-served by existing tutoring systems, while minimally affecting the learning experience of students who already use tutors appropriately. It is worth remembering that students game the system in a variety of different types of learning environments, not just in Cognitive Tutors. Though I do not directly address how gaming affects student learning in these systems, or how these systems should adapt to gaming, it will be a valuable area of future research to determine how this thesiss findings transfer from cognitive tutors to other types of interactive learning environments. Studies The work reported in this thesis is composed of three classroom studies, multiple iterations of the development of a system to automatically detect gaming, analytic work, and the design and implementation of a system to adapt to when students game. The first study (Study One) took place in the Spring of 2003. In Study One, I combined data from human observations and pre-test/post-test scores, to determine what student behaviors are most associated with poorer learning, finding that gaming the system is particularly associated with poorer learning (Chapter 2). Data from this study was used to create the first gaming detector (Chapter 3); in developing the gaming detector, I determined that gaming split into two automatically distinguishable categories of behavior, associated with different learning outcomes (Chapter 3). Data from Study One was also useful for developing first hypotheses as to what characteristics and attitudes were associated with gaming (Chapter 4). The second study (Study Two) took place in the Spring of 2004. In Study Two, I analyzed what student characteristics and attitudes are associated with gaming (Chapter 4). I also replicated our earlier result that gaming is associated with poorer learning (Chapter 2), and demonstrated that our human observations of gaming had good inter-rater reliability (Chapter 2). Data from Study Two was also used to refine our detector of gaming (Chapter 3). The third study (Study Three) took place in the Spring of 2005. In Study Three, I deployed a re-designed tutor lesson that incorporated an interactive agent designed to both reduce gaming and mitigate its effects (Chapter 5). I also gathered further data on which student characteristics and attitudes are associated with gaming (Chapter 4), using this data in combination with data from Study Two to develop a profile of gaming students (Chapter 4). Finally, Data from Study Three was used in a final iteration of gaming detector improvement (Chapter 3).11 Chapter Two Gaming the System and Learning In this chapter, I will present two studies which provide evidence on the relationship between gaming the system and learning. Along the way, I will present a method for collecting quantitative observations of student behavior as they use intelligent learning environments in class, adapted from methods used in the off-task behavior and behavior modification literatures, and consider how this methods effectiveness can be amplified with machine learning. Study One By 2003 (when the first study reported in this dissertation was conducted), gaming had been repeatedly documented, and had inspired the re-design of intelligent tutoring systems both at Carnegie Mellon University/Carnegie Learning (documented later in Aleven 2001, and Murray and vanLehn 2005) and at the University of Massachusetts (documented later in Beck 2005). Despite this, there was not yet any published evidence that gaming was associated with poorer learning. In Study One, I investigate what learning outcomes are associated with gaming, comparing these outcomes to the learning outcomes associated with other behaviors. In particular, I compare the hypothesis that gaming will be specifically associated with poorer learning, to Carrolls Time-On-Task hypothesis(Carroll 1963; Bloom 1976).Under Carrolls Time-On-Task hypothesis, the longer a student spends engaging with the learning materials, the more opportunities the student has to learn. Therefore, if a student spends a greater fraction of their time off-task (engaged in behaviors where learning from the material is not the primary goal)1, they will spend less time on-task, and learn less. If the Time-On-Task hypothesis were the main reason why off-task behavior reduces learning, then any type of off-task behavior, including talking to a neighbor or surfing the web, should have the same (negative) effect on learning as gaming does. Methods I studied the relationship between gaming and learning in a set of 5 middle-school classrooms at 2 schools in the Pittsburgh suburbs. Student ages ranged from approximately 12 to 14. As discussed in Chapter 1, the classrooms studied were taking part in the development of a new 3-year Cognitive Tutor curriculum for middle school mathematics. Seventy students were present for all phases of the study (other students, absent during one or more days of the study, were excluded from analysis). 1Itispossibletodefineon-taskaslookingatthescreen,inwhichcasegamingthesystemisviewedasanon-task behavior.Ofcourse,thedefinitionofon-taskdependsonwhatoneconsidersthestudentstasktobeIdonot consider just looking at the screen to be that task. 12 I studied these classrooms during the course of a short (2 class period) Cognitive Tutor lesson on scatterplot generation and interpretation this lesson is discussed in detail in Appendix A. The day before students used the tutoring software, they viewed a PowerPoint presentation giving conceptual instruction (shown in Chapter 1). I collected the following sources of data to investigate gamings relationship to learning: A pre-test and post-test to assess student learning, quantitative field observations to assess each students frequency of different behaviors, students end-of-course test scores (which incorporated both multiple-choice and problem-solving exercises) as a measure of general academic achievement2. We also noted each students gender, and collected detailed log files of the students usage of the Cognitive Tutoring software. The pre-test was given after the student had finished viewing the PowerPoint presentation, in order to study the effect of the Cognitive Tutor rather than studying the combined effect of the declarative instruction and Cognitive Tutor. The post-test was given at the completion of the tutor lesson. The pre-test and post-test were drawn from prior research into tutor design in the tutors domain area (scatterplots), and are discussed in detail in Appendix B. The quantitative field observations were conducted as follows: Each students behavior was observed a number of times during the course of each class period, by one of two observers. I chose to use outside observations of behavior rather than self-report in order to interfere minimally with the experience of using the tutor I was concerned that repeatedly halting the student during tutor usage to answer a questionnaire (which was done to assess motivation by deVicente and Pain (2002)) might affect both learning and on/off-task behavior. In order to investigate the relative impact of gaming the system as compared to other types of off-task behavior, the two observers coded not just the frequency of off-task behavior, but its nature as well. This method differs from most past observational studies of on and off-task behavior, where the observer coded only whether a given student was on-task or off-task (Lahaderne 1968; Karweit and Slavin 1982; Lloyd and Loper 1986; Lee, Kelly, and Nyre 1999). The coding scheme consisted of six categories: 1.on-task -- working on the tutor 2.on-taskconversation--talkingtotheteacheroranotherstudentaboutthesubject material 3.off-task conversation talking about anything other than the subject material 4.off-tasksolitarybehavioranybehaviorthatdidnotinvolvethetutoringsoftwareor another individual (such as reading a magazine or surfing the web) 5.inactivity -- for instance, the student staring into space or putting his/her head down on the desk for the entire 20-second observation period 6.gamingthesysteminputtinganswersquicklyandsystematically,and/orquicklyand repeatedly asking for help until the tutor gives the student the correct answer 2 We were not able to obtain end-of-course test data for one class, due to that classs teacher accidentally discarding the sheet linking students to code numbers. 13 In order to avoid bias towards more interesting or dramatic events, the coder observed the set of students in a specific order determined before the class began, as in Lloyd and Loper (1986). Any behavior by a student other than the student currently being observed was not coded. A total of 563 observations were taken (an average of 70.4 per class session), with an average of 8.0 observations per student, with some variation due to different class sizes and students arriving to class early or leaving late. Each observation lasted for 20 seconds if a student was inactive for the entire 20 seconds, the student was coded as being inactive. If two distinct behaviors were seen during an observation, only the first behavior observed was coded. In order to avoid affecting the current students behavior if they became aware they were being observed, the observer viewed the student out of peripheral vision while appearing to look at another student. In practice, students became comfortable with the presence of the observers very quickly, as evinced by the fact that we saw students engaging in the entire range of studied behaviors. The two observers observed one practice class period together before the study began. In order to avoid alerting a student that he or she was currently being observed, the observers did not observe any student at the same time. Hence, for this study, we cannot compare the two observers assessment of the exact same time-slice of a students behavior, and thus cannot directly compute a traditional measure of inter-rater reliability. The two observers did conduct simultaneous observations in Study Two, and I will present an inter-rater reliability measure for that study. Results Overall Results The tutor was, in general, successful. Students went from 40% on the pre-test to 71% on the post-test, which was a significant improvement, F(1,68)=7.59, p

Documents

Dissertation Final