50
Jonathan Huang Stanford University Data driven student feedback for Programming Intensive MOOCs Towards global scale CS education Leonidas Guibas Chris Piech Andy Nguyen

Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Jonathan Huang Stanford University

Data driven student feedback for Programming Intensive MOOCs

Towards global scale CS education

Leonidas Guibas Chris Piech Andy Nguyen

Page 2: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Steve Jobs Stanford, 2005

“all of my working-class parents' savings were being spent on my college tuition… …the minute I dropped out [of college] I could stop taking the required classes that didn't interest me, and begin dropping in on the ones that looked interesting.”

Page 3: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Course selection is better online

MOOC = Massive Open Online Courses

Page 4: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Untapped potential

CS does not count towards high school math/science

requirements in 36 of 50 states

$0

$10

$20

$30

$40

73-7

478

-79

83-8

488

-89

93-9

498

-99

03-0

408

-09

13-1

4

Thou

sand

s

*** source: www.code.org, www.collegeboard.org

0

500

1000

1500

2000

2011 2016 2021

Thou

sand

s

Page 5: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

400 students 100,000 students

Stanford ML-class

Page 6: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

10 TAs 2,500 TAs (???)

Stanford ML-class

Page 7: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Ease of global scale feedback on a spectrum

Short Response

Long Response

Multiple choice

Essay questions

Proofs Today: Programming Assignments

Page 8: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Feedback for Coding Assignments: Easy?

8

Test Inputs

Correct / Incorrect ? Test Outputs

Linear Regression submission (Homework 1) for Coursera’s ML class

Page 9: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The “but it works!!” solution function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newMat = hypo – y; trans1 = (X(:,1)); trans1 = trans1’; newMat1 = trans1 * newMat; temp1 = sum(newMat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2))’ ; newMat2 = trans2*newMat; temp2 = sum(newMat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computeCost(X, y, theta); end theta(1) = theta(1); theta(2)= theta(2); Why??

Correctness Efficiency

Style Elegance

Better: theta = theta-(alpha/m) *X'*(X*theta-y)

Good Good Poor Poor

Page 10: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

for a class of 100,000, and in real time,

Let’s do this:

New programming problems

New courses

But… can’t require too much instructor effort to make this work for:

New programming languages

Page 11: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

We now have massive datasets…

Visualization of 40,000 implementations of linear regression submitted to Coursera’s ML course

[Moocshop, 2013]

# St

uden

ts

Intro CS

1K 10K

20M

Page 12: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Efficient index for “code phrases” of a

MOOC dataset

Shared structure discovery amongst

many student submissions

Applications such as bug finding (w/o execution) and

MOOC-scale feedback

Results on real MOOC with > 1 million

submissions

Codewebs Engine

Page 13: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Correct

Incorrect

Page 14: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Correct

Incorrect

Dear Lisa Simpson, consider the dimension of the expression:

X'*(X*theta-y) and what happens after you call sum on it…

Syntax based approach:

Attach this message to everyone containing that exact expression

(covers 99 submissions)

Page 15: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The extraneous sum bug takes many forms…

(Easier) Output based approach:

Attach message to everyone who matched extraneous sum bug in unit test output

(covers 1091 submissions)

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

theta = theta-alpha*1/m*sum(((theta’*X’)’-y)’*X);

theta = theta-alpha*1/m*sum(transpose(X*theta-y)*X);

Page 16: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Codewebs approach to feedback

Combined 1604

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression

Output based 1091

Codewebs 1208

# submissions covered by single message

Page 17: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Codewebs approach to feedback

Combined 1604

theta = theta-alpha*1/m*sum(X'*(X*theta-y));

Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression

Output based 1091

Codewebs 1208

# submissions covered by single message

~47% improvement over just using an output based

feedback system!!

Page 18: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Abstract syntax tree representations

• Whitespace • Comments • …

ASTs ignore:

function A = warmUpExercise() A = []; A = eye(5); endfunction

ASSIGN

IDENT (A) INDEX_EXP

IDENT (eye) ARGUMENT_LIST

CONST (5)

ASTs

Page 19: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Indexing documents by phrases

term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

The bright and blue butterfly hangs on the breeze…

We all something something yellow submarine…

“blue sky” “yellow submarine”

Page 20: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Indexing documents by phrases

term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

The bright and blue butterfly hangs on the breeze…

We all something something yellow submarine…

“blue sky” “yellow submarine”

What basic queries should an AST search engine support?

Page 21: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Code Phrases

Subtrees and subforests of an AST

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta)

Page 22: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Code Phrases

Context within a larger subtree

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta)

Page 23: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Code Phrases

Context within a larger subtree

BINARY_EXP (*)

POSTFIX (‘) BINARY_EXP (-)

IDENT (X) IDENT (y)

replacement site

Page 24: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

10 print “hello” 20 goto 10

10 for i=1:10 20 x = x+1 30 end

Page 25: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

10 print “hello” 20 goto 10

10 for i=1:10 20 x = x+1 30 end

Very expensive, esp. for large ASTs! def buildIndex(): for A in ASTs: for every code phrase x contained in A: Compute hashcode h[x] Insert A at h[x]

subtrees, subforests, contexts

Page 26: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Hashing Code Phrases

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Step 1. Create postorder listing of nodes.

Step 2. Hash postorder list via:

Page 27: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Recycling hash computations

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!

Page 28: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Recycling hash computations

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!

Idea of DP: Store prefix hashes and prime powers for all

O(n) in time and space

Page 29: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Recycling hash computations

BINARY_EXP (-)

IDENT (y) BINARY_EXP (*)

IDENT (X) IDENT (theta) BINARY_EXP (-)

IDENT (y)

BINARY_EXP (*)

IDENT (X)

IDENT (theta)

Observation: Can hash sublist of postorder to get hash of code phrases!

After precomputation, can get any other hash in constant time!

Page 30: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Indexing is fast in practice

00.5

11.5

22.5

3

0 100 200 300 400 500

Run

time

(sec

onds

)

Average AST size (# nodes)

Time for indexing 1000 ASTs

Page 31: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

Application: Statistical Bug Finding

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end

83% of ASTs containing this code phrase were buggy!

Solution: X'*(X*theta-y);

vs.

Page 32: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Query Index

Fail Fail Fail Pass Fail Fail

83% of ASTs containing this code phrase were buggy!

Is sum(X'*(X*theta-y)) likely to be a bug?

Page 33: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Query Index

Fail Fail Fail Pass Fail Fail

83% of ASTs containing this code phrase were buggy!

Is sum(X'*(X*theta-y)) likely to be a bug?

Page 34: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Query Index

Fail Fail Fail Pass Fail Fail

83% of ASTs containing this code phrase were buggy!

Is sum(X'*(X*theta-y)) likely to be a bug?

Compute bug probability for all subforests, return smallest bugs found

Many ways to formulate probabilistic bug localization (we compute probability on local contexts)

Page 35: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Bug Detection Accuracy

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Bug

det

ectio

n F-

scor

e (B

asel

ine)

Bug detection F-score (Codewebs)

neural net training with backpropagation

logistic regression objective

linear regression with gradient descent

Each point represents a single coding problem. Bubble size = Average # nodes per submitted AST

better

bette

r

Page 36: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

More than one way to skin a cat…

Canonicalization: apply semantic preserving transformation rules to ASTs to increase matching probability

Page 37: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

X*1*(Y+Z)

X*[1;1]’*[Y;Z]

1. Impossible to predict all ways of writing the same thing

1. Canonicalization rules not

typically generalizable across languages

X*ones(1,2)*[Y;Z]

X*repmat(1,2,1)’*[Y;Z]

X*ones(1,length([Y;Z]))*[Y;Z]

X*transpose([1;1])*[Y;Z]

X*repmat(1,size([Y;Z],1),1)’*[Y;Z]

X*(Y+Z)

X*(Z+Y)

(Y+Z)*X

X*Z+X*Y

Z*X+X*Y

Z*X+Y*X

Difficulties with Canonicalization

Page 38: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Codewebs Approach Use data to determine canonicalization rules

Customize rules to each assignment

Don’t need to be perfect, we’re not building a compiler!

Page 39: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Here’s the idea. def residual (X, theta, y): hypothesis = X * theta solution = hypothesis - y return solution

def residual(X, theta, y): hypothesis = (theta’ * X’)’ solution = hypothesis - y return solution

Page 40: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Counter Example

Agreement can be context dependent

def foo(): solution = solveProblem() print(solution) return solution

def foo(): solution = solveProblem() print(!solution) return solution

Page 41: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

= ?

Page 42: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Join on context

Query Index

Query Index

= ?

Page 43: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Fail Fail Pass Pass Fail Pass

Fail Fail Pass Pass Fail Pass

100% probability of equivalence! ** fine print: need to account for sample size in general

Page 44: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Workflow

theta = theta-alpha*1/m*(X'*(X*theta-y));

“alphaOverM” “prediction”

“residual”

Human provides:

Page 45: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

length (y) size (X, 1) size (y, 1)

rows (X) m rows (y)

length (X) length (x (:, 1)) size (X) (1)

(theta' * X' - y')'

(X * theta - y)

({hypothesis} - y)

({hypothesis}' - y’)'

[{hypothesis} - y]

sum({hypothesis} - y, 2)

alpha * (1 ./ {m}) alpha * (1 / {m}) alpha * 1 ./ {m}

.01 / {m} alpha * {m} ^ -1 alpha .* (1 ./ {m})

1 / {m} * alpha alpha ./ {m} alpha .* (1 / {m})

alpha * inv ({m}) 1 .* alpha ./ {m} alpha * pinv ({m})

alpha / {m} alpha .* 1 / {m} 1 * alpha / {m}

(theta' * X')'

(X * theta)

theta(1) + theta (2) * X (:, 2)

(X * theta (:))

[X] * theta

sum(X.*repmat(theta',{m},1), 2) …

{m}

{alphaOverM}

{hypothesis} {residual}

Page 46: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Canonicalization improves bug detection accuracy

0.650.7

0.750.8

0.850.9

0.95F-

scor

e

# unique ASTs considered

without canonicalization

with canonicalization

High

er is

bet

ter

Page 47: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

How many submissions can we give feedback to with fixed effort?

0

5000

10000

15000

20000

25000

0 1 10 19

# su

bmis

sion

s co

vere

d

(out

of 4

0,00

0)

# equivalence classes

with 25 ASTsmarked

with 200 ASTsmarked

Canonicalization, 25 marked ASTS

No Canonicalization, 200 marked ASTs

Page 48: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

If we can find shared structure, we can facilitate feedback

In education: ASTs, proofs, essays, architecture, poems…

“gradient” “residual”

“learning rate”

“base case”

“inductive hypothesis”

“apartheid”

“Afrikaner Calvinism”

“elections in South Africa”

Page 49: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Data is revolutionizing many fields

And it can revolutionize education too!

Page 50: Data driven student feedback for Programming Intensive MOOCs · Coursera’s ML course [Moocshop, 2013] # Students. Intro CS . 1K . 10K . 20M . Efficient index for “code phrases”

Thank you!! [email protected] http://www.stanford.edu/~jhuang11 @jonathanhuang11