Towards improving programming habits to create better ...dev.stamper.org/publications/ITICSE_2013_Spacco_et_al.pdf · Towards Improving Programming Habits to Create Better Computer

Towards Improving Programming Habits to Create BetterComputer Science Course Outcomes

Jaime SpaccoComputer Science

DepartmentKnox College

Galesburg, IL [email protected]

Davide FossatiComputer Science

DepartmentCarnegie Mellon University in

QatarP.O. Box 24866

Doha, [email protected]

John StamperKelly Rivers

Human-Computer InteractionInstitute

Carnegie Mellon UniversityPittsburgh, PA 15213

[email protected]@cs.cmu.edu

ABSTRACTWe examine a large dataset collected by the Marmoset sys-tem in a CS2 course. The dataset gives us a richly detailedportrait of student behavior because it combines automat-ically collected program snapshots with unit tests that canevaluate the correctness of all snapshots. We find that stu-dents who start earlier tend to earn better scores, which isconsistent with the findings of other researchers. We alsodetail the overall work habits exhibited by students. Fi-nally, we evaluate how students use release tokens, a novelmechanism that provides feedback to students without giv-ing away the code for the test cases used for grading, andgives students an incentive to start coding earlier. We findthat students seem to use their tokens quite effectively toacquire feedback and improve their project score, thoughwe do not find much evidence suggesting that students startcoding particularly early.

Categories and Subject DescriptorsK.3.2 [Computers and Education]: Computer and Infor-mation Sciences Education—Computer Science Education

General TermsHuman Factors, Measurement

Keywordscomputer science education, CS2, novice programmers

1. INTRODUCTIONIn the United States, recent graduates with Computer Sci-

ence degrees can expect high starting salaries [15] and ex-cellent job prospects [14]. However, students earned aboutthe same number of Computer Science Bachelors’ in 2008

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ITiCSE’13, July 1–3, 2013, Canterbury, England, UK.Copyright 2013 ACM 978-1-4503-2078-8/13/07 ...$15.00.

as in the mid 1980s [4], despite large increases in both over-all college enrollment and the overall number of Bachelors’degrees awarded. The job market seems to be demandingstudents with the knowledge and skills taught in undergrad-uate Computer Science departments, yet students have notflooded into the CS major.

In fact, supplying the growing technological job marketwith knowledgable young computer scientists has provenquite difficult. Attrition rates in undergraduate CS pro-grams are high [2], and there’s evidence of a gap betweenwhat faculty think students should be able to do and whatstudents actually can do [12]. High school students do notseem to know what computer science is [5], let alone showmuch interest in majoring in the discipline. Men earn about80% of undergraduate computer science degrees in the UnitedStates [4], and this large gender gap is possibly keeping tal-ented women out of a dynamic, exciting major.

Given that many introductory CS students struggle withlearning to program, and that most attrition happens bythe second year of study [2], helping students learn to pro-gram seems like a reasonable way to keep students in themajor. There have been excellent recent efforts at collect-ing data to help us better understand novice compilationbehavior[10, 17, 18], and novice programming habits [6, 13,9]. This empirical work has allowed us to begin addressingkey questions such as “How much time do students spend onprogramming assignments?”, and“How much benefit do stu-dents get from starting early?”. However, our understandingof novice programmers is still quite incomplete, and can beimproved. We need a deep and nuanced understanding ofhow students learn to program before we can know where tomake improvements.

This paper makes several contributions:

• We combine passively collected snapshots with unittests to build a richly detailed portrait of student workhabits.

• We find a statistically significant correlation betweenstarting an assignment early and earning a higher score.This supports the conclusions of prior studies [8, 1, 7].

• We explore students’ work habits in detail, examiningwhat hours of day students work, how much work isdone the day before and the day of the deadline, andthe total amount of time spent coding on each project.

243

course CS-2semester Spring 2006language Java# projects 6# students 96# snapshots 55,070# compilable snapshots 37,401

Table 1: Overview of the Marmoset dataset.

• We examine students’ use of release tokens, first intro-duced by the Marmoset data collection and automatedtesting system [16], showing evidence that suggests thetokens help students achieve higher scores without giv-ing away the code to all of the test cases.

2. THE MARMOSET DATASETIn this work, we analyze data collected by the Marmoset

system [16] in Spring 2006 during a CS2 course at the Uni-versity of Maryland. Marmoset is an automated testing anddata collection framework that combines three key compo-nents: a client-side data collection plugin for students’ IDEs,a central server for testing student submissions against unittests, and a novel token-based incentive system for revealingselect feedback about the test cases (we give more detailsabout release tokens in Section 2.2). The combination offine-grained snapshots and detailed unit tests is particularlyimportant, as we can run every snapshot against the unittests for that project and pinpoint the precise set of changesthat allowed a student to pass challenging test cases.

2.1 Snapshot Collection with the Eclipse Plu-gin

Marmoset uses an Eclipse plugin to capture snapshots ofstudent files at each save event. Because Eclipse performsincremental compilation, the snapshots are not necessarilypart of a traditional compile-edit cycle, producing a datasetqualitatively different than the compilation-based snapshotdatasets collected by other researchers [10, 17, 13]. How-ever, like these other datasets, our dataset is file-grained.Table 1 breaks down the overall statistics of the Marmosetsnapshots.

2.2 Release TestingRelease tests were designed to achieve two goals: 1) Pro-

vide feedback about the correctness of student code withoutgiving away the code to the test cases, and 2) Give studentsan incentive to start coding early. The first goal is to pre-vent students from simply “coding to the tests”, while thesecond goal is a reaction to the widely held belief by facultythat students simply do not start coding early enough finishthe project.

Students using Marmoset can upload their code to a cen-tral server, which executes their code against a series of unittests. These test cases are partitioned into three categories:

• Public tests: The full source code of public tests isgiven to students with the project. Public tests areexecuted against all submissions to the server, and theresults are always visible to students. All projects inthis dataset had a couple of public tests to help stu-dents get started testing their code.

• Release tests: The source code of release tests isnever revealed to students. Release tests are run againstevery submission, although students must spend a re-lease token to receive feedback about the results of re-lease testing. Students have three release tokens, andeach token takes 24 hours to regenerate after use. Inorder to use a token, students must upload their codeto the Marmoset server, authenticate themselves to theMarmoset server through a web interface, and thenchoose to spend a token. Using a release token revealsthe number of release tests passed and failed, as wellas additional feedback about the first two failed tests,such as the names of the test cases and the stack traceof any exceptions generated by the test.

• Secret tests: Neither the source code for the testsnor the outcome of running the test cases is revealedto students. Secret tests are equivalent to the de factomethod of grading programming assignments for in-structors who do not provide students with any pre-deadline feedback.

Theoretically, the scarcity of the release tokens should in-centivize students to start working early, and to test theircode thoroughly before spending a token. In our experi-ments, we examine students’ use of release tokens over time.

2.3 Other Marmoset DetailsThe Marmoset snapshots used in this work are a rich, pow-

erful resource in helping understand novice programmers.However, the data have a number of characteristics, as wellas key limitations, which should be kept in mind when as-sessing this work:

• There have not been any controlled studies of the re-lease token mechanism, so it is very difficult to drawany strong conclusions about the effectiveness of thissystem. The lack of controlled studies is due to logis-tical and policy hurdles rather than any fundamentallimitation of the system.

• The automated grading makes extensive use of unittests, which means that students are programming toan API rather than writing code from scratch. Wecannot use this data to evaluate students’ design abil-ities, or to understand how they would program whenbeginning with a blank screen.

• The dataset is from a CS2 course, so the students arenot true novices. Attrition rates are typically lower inCS2 than in CS1, so this data may serve more use indetermining what helps students perform better. It isworth noting that Marmoset has been used before inCS1 courses, though we do not have access to the datacollected.

• The dataset only includes scores for the functional cor-rectness of a project as computed by the unit tests.Each assignment was also assessed manually for style,but we do not have access to those scores, nor do wehave the students’ exam scores. Thus, we are only ableto consider the students’ ability to write functionallycorrect code.

• The data collected only captures the amount of timestudents actively spend programming. We are unable

244

to measure the time spent reading the textbook, out-lining the program on paper or the whiteboard, readingonline resources, or any of the other work that con-tribute to a successful programming assignment. Thislimitation is not unique to Marmoset; all automaticsnapshot capture systems are limited to data they caneasily collect.

3. RELATED WORKSome work has already been done in analyzing large cor-

puses of student data in order to find useful trends and cor-relations. For example, several researchers have examinedcompilation behavior, in order to determine how studentsreact to errors and which errors are most common [10, 9].Correlations have also been found between students’ rela-tionships with errors and their midterm scores which canbe used to identify at-risk students [17]. Some researcherslook more at how student behaviors change over time, usingBayesian learning models [11] and learning analytics throughvisualizations [3].

Closer to our area of study, analysis has also been doneon student submission behavior and how it affects learning.Some researchers have conducted work very similar to ours,using submission data from Web-CAT to find that studentswho submitted more and earlier were less likely to fail theassignment [1, 7]. In addition, it has also been shown thatstudents who submit late on the first assignment are morelikely to continue submitting late and do more poorly in theclass [8]. Fenwick et al have also found trends suggestingthat students who program in more ’sessions’ (doing moreincremental work) are more likely to achieve higher grades[9]. We aim to replicate and expand on this work, whilealso more closely examining the effects of hidden tests andrelease tokens.

4. EXPERIMENTSThe Marmoset dataset combines snapshots with unit tests

to give researchers a detailed view into student work pat-terns (i.e. when they are working, what they are changing)and the correctness of their code (i.e. what test cases arethey passing). Such fine-grained data allows us to see theprecise snapshot where students have a breakthrough in thecorrectness of their program, as well as what they were doingbefore and what they do after. By grouping the snapshotsinto estimated “work sessions”, we can look at factors withina work session that increase or decrease the score achievedat the end of the session.

4.1 Overview of the DataThe dataset has over 37,000 compilable snapshots from

96 students over six CS2 projects. Table 2 shows the meanand median scores for each project, as well as the number ofstudents with at least one submission for each project (sub-mitting to the server is required for credit on a project; thesnapshots are purely for research and automated backups).Students receive zero points if they fail to make a submis-sion for a project; the true mean and median values wouldbe slightly lower if we included those default scores of zero.

Figure 1 shows the distribution of snapshots over the hoursof the day. We see a lot of activity between 4pm and 6pm(16:00 to 18:00). We initially thought this was because dead-lines were always at 6pm and we were seeing a rush to com-

Project median mean# students wa submission

1 100% 90.8% 902 82.9% 60.2% 893 95.0% 82.2% 904 92.2% 82.2% 855 100% 85.3% 846 90.0% 80.1% 81

Table 2: For each project, we report the mean andmedian scores, as well as the number of students(out of 96) who made at least one submission.

plete the project; however, we looked at only the snapshotsthat occurred more than 24 hours before the deadline, andfound essentially the same shape as in Figure 1. Thus, if stu-dents prefer working 4pm to 6pm, then setting the deadlinea couple of hours later might allow students to work at theirpreferred time without the added pressure of an impendingdeadline.

Figure 1: Number of snapshots distributed overhours of the day. Note the spike in activity between4pm and 6pm.

Figure 2 shows the number of snapshots per day in the 10days leading up to each project deadline (students typicallyhave 10 days to complete a project). The bar for each dayis broken up into positive and neutral or negative snapshots.Positive snapshots increase the number of passing test cases,and hence the score, from the previous snapshot, while neu-tral/negative snapshots either make no improvements, ordecrease the score. We see a lot of improvements made theday before, and the day of, the project deadline, which sug-gests that many students are doing substantial work in the48 hours before the deadline. This may mean that studentsare not, in fact, starting work early enough. We more workand more improvements 5 days before the deadline that wedo 3 or 4 days before. We believe this occurs because fourout of six deadlines were on a Thursday, meaning studentsworked more on Saturdays than on Sundays.

Figure 3 compares each student’s start time (i.e. the timebefore the deadline of their first snapshot, to the nearesthour) to the final score they ultimately achieved. The re-lationship is statistically significant, and gives a clear in-dication that starting an assignment correlates with betterscores. Note that the test cases may not be exhaustive, andthat there may be differences in quality or correctness be-

245

Figure 2: Number of snapshots per day leading up tothe deadline. Positive snapshots are snapshots thatincrease the score from the previous snapshot (i.e. apositive delta), while negative or neutral snapshotshave an identical or lower score than the previoussnapshot.

tween the programs that earned 100% that these test casesare unable to identify.

4.2 Work SessionsMarmoset collects a snapshot whenever students save their

files, so snapshots are not independent events. Two studentsmay save files at very different intervals due to personal id-iosyncrasies that have nothing to do with their programmingability. Thus we have to take care when aggregating infor-mation about snapshots.

In order to get a more accurate picture of student workhabits, we have clustered each student’s snapshots into es-timated “work sessions” using the very rough heuristic thatsnapshots less than 20 minutes apart are part of the samework session. We have not observed the work habits of any ofthe students in this data set, so we lack a good understand-ing of student work habits needed to validate our heuristic.However, we expect that even an approximate clustering ofthe snapshots will give us a more accurate picture of stu-dent work patterns than treating each snapshot indepen-dently would. Figure 4 shows the relationship between theestimated time spent programming in each session and thescore differential (i.e. the difference in the score betweenthe first and last snapshot of a work session). There is astatistically significant relationship, though the correlationis fairly weak.

Figure 5 shows the relationship between the total esti-mated time spent programming on each project and the fi-nal score on the project. The relationship, while statisticallysignificant, is relatively weak, and could simply indicate thatstudents who spent only a few minutes on an assignment endup with extremely low scores.

Figure 3: The X-axis shows the time of the firstsnapshot (i.e. when the student began writing theprogram), while the Y-axis shows the final score thestudent eventually achieved.

4.3 Release TokensStudents using Marmoset are given a set of public unit

tests, while other unit tests, called release tests, are storedon the server. Students can upload their code to the server,spend a release token, and see the number of release testspassed and failed, as well as additional feedback about thefirst two release tests which failed. Furthermore, studentsonly have 3 release tokens, and each token takes 24 hours toregenerate.

The goals of release tests are to give helpful feedback with-out giving away the test cases, and to create an incentive forstudents to start programming assignments early. We find,however, that students only used an average of 2.4 releasetokens per assignment, and that about 55% of the total re-lease tokens used were deployed on the day of the deadline,or the preceding day. These data suggest that students areusing release tests to get feedback, but are not necessarilystarting early to use more than three release tokens.

Figure 6 shows the relationship between the number ofrelease tokens used per work session (maximum of 3) andthe difference between the score at the beginning and endof the session. The relationship is statistically significant,though the model does not explain much of the variance.

Table 3 splits up the work sessions by whether they useat least one release token, then by whether they increase(positive), decrease (negative), or have no effect (neutral)on the score. In other words, a positive work session meansthat the score for the snapshot at the end of the session ishigher than the score for the snapshot at the beginning. Thedifferences are statistically significant (p < 0.001) accordingto a Chi Squared analysis.

246

Figure 4: Relationship between the length of esti-mated work sessions and the change in project scorefrom the start to the end of the session.

positive neutral negative total

used >= 1 token 515 470 36 1021no tokens used 959 2129 175 3263

total 1474 2599 211 4284

Table 3: Work sessions divided up, first by whetherthey increase (positive), decrease (negative), or haveno effect (neutral) on the score, then by whether atleast one release token was used in the session.

5. CONCLUSIONS AND FUTURE WORKCombining automatically collected program snapshots with

unit tests has given us a detailed picture of student workpatterns, revealing some obvious facts (i.e. students do verylittle work between 4am and 8am), as well as some less ob-vious facts (i.e. students work more on Saturdays than onSundays and more between 4pm and 6pm than any other 2hour block of time).

Unsurprisingly, we find that students do the majority oftheir work within 48 hours of the deadline, and that studentswho start earlier tend to earn better grades, which supportsthe findings of previous researchers [8, 1, 7]. We found aweak relationship between the total amount of time codingand the final score, though this may simply reflect the factthat even the best students must spend a minimum amountof time coding the solution.

We also examined the use of release tests in more detail.We found that students used an average of 2.4 release testsper project, that work sessions that used one or more releasetest were statistically more likely to improve the student’sscore than work sessions that do not use a release token,and that using additional release tokens in a work sessiontends to improve the session. This evidence suggests thatstudents are effectively using their release tokens to obtainfeedback and improve their scores. In turn, this suggests

Figure 5: Relationship between the total estimatedtime spent writing code and the final score on theproject.

that the release tokens are indeed providing valuable feed-back to students without revealing the code to the test cases.This achieves the first design goals of release tests: To pro-vide limited but useful feedback to students. Without acontrolled study of release tokens, it will be difficult to showthat the release tokens are achieving the second design goal:influencing students to start coding earlier. The data in pa-per, namely that students use an average of 2.4 tokens perproject and that 55% of release tokens are used in the lasttwo days before a deadline, suggest that students are notstarting their projects particularly early. In fact, it’s possi-ble that the tokens have the unanticipated side-effect of dis-couraging students from writing their own test cases, sinceeven procrastinators who start coding the same day thatthe project is due know that they have at 3 tokens to use.We know that the instructors who regularly use the systemeventually decreased the number of release tokens availablefrom 3 to 2 in order to further discourage procrastination.

In the future, we hope to conduct a controlled study ofrelease tokens, where an experimental group has release testsand release tokens, while the control group has secret tests.This study would be a tremendous help in understandinghow students respond to the incentives offered by releasetokens.

6. ACKNOWLEDGMENTSThe authors would like to thank Bill Pugh and Nelson

Padua-Perez for collecting the data used for this study intheir course, and the Pittsburgh Science and Learning Cen-ter’s LearnLab Summer School for bringing the authors to-gether.

7. REFERENCES[1] A. Allevato, M. Thornton, S. H. Edwards, and M. A.

PAl’rez-QuiAsones. Mining data from an automated

247

Figure 6: Relationship between release tokens usedin a work session and the score differential of thework session.

grading and testing system by adding rich reportingcapabilities. In R. S. J. de Baker, T. Barnes, and J. E.Beck, editors, EDM, pages 167–176.www.educationaldatamining.org, 2008.

[2] T. Beaubouef and J. Mason. Why the high attritionrate for computer science students: some thoughts andobservations. SIGCSE Bull., 37(2):103–106, June 2005.

[3] P. Blikstein. Using learning analytics to assessstudents’ behavior in open-ended programming tasks.In Proceedings of the 1st International Conference onLearning Analytics and Knowledge, LAK ’11, pages110–116, New York, NY, USA, 2011. ACM.

[4] N. S. Board. Science and engineering indicators digest2012. 2012.

[5] L. Carter. Why students with an apparent aptitude forcomputer science don’t choose to major in computerscience. SIGCSE Bull., 38(1):27–31, Mar. 2006.

[6] S. H. Edwards. Using software testing to movestudents from trial-and-error to reflection-in-action.SIGCSE Bull., 36(1):26–30, Mar. 2004.

[7] S. H. Edwards, J. Snyder, M. A. Perez-Quinones,A. Allevato, D. Kim, and B. Tretola. Comparingeffective and ineffective behaviors of studentprogrammers. In Proceedings of the fifth internationalworkshop on Computing education research workshop,ICER ’09, pages 3–14, New York, NY, USA, 2009.ACM.

[8] N. Falkner and K. Falkner. A fast measure foridentifying at-risk students in computer science. InProceedings of the ninth annual internationalconference on International computing educationresearch, pages 55–62. ACM, 2012.

[9] J. B. Fenwick, Jr., C. Norris, F. E. Barry, J. Rountree,C. J. Spicer, and S. D. Cheek. Another look at the

behaviors of novice programmers. In Proceedings ofthe 40th ACM technical symposium on Computerscience education, SIGCSE ’09, pages 296–300, NewYork, NY, USA, 2009. ACM.

[10] M. C. Jadud. A first look at novice compilationbehaviour using bluej. Computer Science Education,15:1–25, 2005.

[11] J. Kasurinen and U. Nikula. Estimating programmingknowledge with bayesian knowledge tracing. InProceedings of the 14th annual ACM SIGCSEconference on Innovation and technology in computerscience education, ITiCSE ’09, pages 313–317, NewYork, NY, USA, 2009. ACM.

[12] M. McCracken, V. Almstrum, D. Diaz, M. Guzdial,D. Hagan, Y. B.-D. Kolikant, C. Laxer, L. Thomas,I. Utting, and T. Wilusz. A multi-national,multi-institutional study of assessment ofprogramming skills of first-year cs students. SIGCSEBull., 33(4):125–180, Dec. 2001.

[13] C. Murphy, G. Kaiser, K. Loveland, and S. Hasan.Retina: helping students and instructors based onobserved programming activities. In Proceedings of the40th ACM technical symposium on Computer scienceeducation, SIGCSE ’09, pages 178–182, New York,NY, USA, 2009. ACM.

[14] P. C. of Advisors on Science and Technology.Designing a digital future: Federally funded researchand development in networking and informationtechnology. 2013.

[15] N. A. of Colleges and Employers. Nace january 2013salary survey. 2013.

[16] J. Spacco, J. Strecker, D. Hovemeyer, and W. Pugh.Software repository mining with marmoset: anautomated programming project snapshot and testingsystem. In Proceedings of the 2005 internationalworkshop on Mining software repositories, MSR ’05,pages 1–5, New York, NY, USA, 2005. ACM.

[17] E. S. Tabanao, M. M. T. Rodrigo, and M. C. Jadud.Predicting at-risk novice java programmers throughthe analysis of online protocols. In Proceedings of theseventh international workshop on Computingeducation research, ICER ’11, pages 85–92, New York,NY, USA, 2011. ACM.

[18] I. Utting, N. Brown, M. Kolling, D. McCall, andP. Stevens. Web-scale data gathering with bluej. InProceedings of the ninth annual internationalconference on International computing educationresearch, ICER ’12, pages 1–4, New York, NY, USA,2012. ACM.

248

Documents

Towards improving programming habits to create better ...dev.stamper.org/publications/ITICSE_2013_Spacco_et_al.pdf · Towards Improving Programming Habits to Create Better Computer