Learning to Teach: Improving Instruction with Machine Learning Techniques

Beverly Park Woolf School of Computer Science, University of Massachusetts

[email protected]

Learning to Teach: Machine Learning to Improve

Instruction

NIPS 2015 Workshop on Human Propelled Machine Learning, Dec 13, 2014

Long, long Term Goal

Millions of schoolchildren will have access to what Alexander the Great enjoyed as a royal prerogerative:

“the personal services of a tutor as well informed as Aristotle”

Pat Suppes, Stanford University, 1966Died Nov 2014)

”Students will have instant access to vast stores of knowledge through their computerized tutors”

Alexander the Great valued learning so highly, that he said he was more indebted to Aristotle for giving him knowledge than to his father for giving him life.

We are on track.

Key components:

Artificial IntelligenceMachine LearningLearning Sciences

We are able to achieve personal services of a tutor for every student and instant access to vast stores of knowledge

Then: ~ 400 BCNow: 2014

Model the Student

Model the Domain

Personalize Tutoring

Assess Learning

Intelligent TutoringSystems

Learning@ Scale

Research Questions

How to retrieve substance from educational data?

What do teachers and students need to know?

What do researchers in Learning Sciences want to know?

• Explore large educational data sets and how they are analyzed– create models and pattern finding.

• How are researchers in the field of educational technology using a variety of techniques to use data to improve teaching and learning?

What Kind of ML Techniques?

– Visualization and modeling – Decision trees – Bayesian networks– Logistic Regression– Temporal Models– Markov Models – Classification: Naïve Bayes, Neural Networks,

Decision trees

Reasoning about the Learner with Machine Learning

TechniquesOpen

Learner Models

Models for

Teachers, Parents

Models of the Domain

Models of Student

Knowledge, Learning

Models of Student

Affect/Motivatio

Engagement/Use and

Misuse/On-off task

Pedagogical

Moves and

Tutorial Actions

Pre-processing: Discretizing Variables, Normalizing and Transforming Variables

Arroyo, EDM 2010 Baker Arroyo, EDM 2010

Visualizations: Single Variables and Relationshipts Bull & Mitrovic Ritter EDM2011 best

paper

Models: Correlations/Crosstabulations Arroyo, Log Files Arroyo, Log Files Arroyo, Log Files

Models: Causal Modeling Beck & Rai

Models: Linear Regression Heffernan Koedinger Arroyo, Shanabrook; Baker

Arroyo --Animalwatc

h

Models: Feature Selection. Splitting Models vs. Accounting for. Martin & Koedinger Arroyo

Classification: Logistic Regression Pavlik (PFA); Gong & Beck, v Cooper, David Beck

Classification: Clustering Desmarais -- non negateive matrix

factorization

Yue Gong UMAP2012, clustering without

features :-)

Classification: Naive Bayes Stern, MANIC

Classification: Neural Networks Burns, Handwriting D'Mello: Predicting affective states Baker

Classification: Decision Trees random forest

approach was widely used in KDD2011 cup

de Vicente & Pain

Models: Association Rule Learning Romero Merceron

Temporal Models: Temporal Patterns and Trails over observable variables, and Markov Chains

Romero (Educational Trails) Shanabrook Shanabrook Shanabrook

Models: Bayesian Networks Zapata-Rivera HeffernanLots of classic ITS work

(HYDRIVE; William Murray)

Conati; Arroyo; Rai Chaz Murray RTDT

Temporal Models: Hidden Markov Models (latent variables) Mayo & Mitrovic Beck; Pardos Johns & Woolf Ivon Arroy, Worcester Polytechnic

Institute

Data Sets Used

–Data sets come from Log Files – Educational tutoring and assessment

software,

Large Data Sets

EventLog Table of a Math Tutoring System. 571,776 rows, just in a year time.

Introduction

Model the Student

Model the Domain


Assess Learning

Agenda


Learning@ Scale

Student Model

Student Model

Student Model

A data-driven approach toward automatic prediction of studentsemotional states without sensors and while students are still actively engaged in their learning.

Models from students ongoing behavior. A cross-validation revealed small gains in accuracy for the more sophisticated state-basedmodels and better predictions of the remaining unpredicted cases, compared to the baseline models.

By modifying the context of the tutoring system including students perceived emotion around mathematics, a tutor can nowoptimize and improve a students mathematics attitudes.

David H. Shanabrook, David G. Cooper, Beverly Park Woolf, and Ivon Arroyo

Student States Describing student/tutor interaction

Problem state patterns

IBMs Many Eyes Word Tree algorithm. The total 1280 ATT (attempted and solved) events. Most frequently ATT was followed by a SOF event (see top tree). The second level of the tree shows that the sequence ATT ATT the highest frequent event changes to the ATTevent, i.e. the shift in behavior occurs after two ATT states (see second tree andtop branch). This indicates the ATT state is more often a solitary event, wherethe ATT ATT pattern will continue in the ATT state. Thus, from the analysis the most frequent 3 problem state patterns (e.g., NOTR-NOTR-NOTR) aredetermined (see third tree and second branch).

Jeff JohnsAutonomous

Learning Laboratory

Beverly WoolfCenter for KnowledgeCommunication

AAAI 7/20/2006

A Dynamic Mixture Model to Detect Student Motivation and Proficiency

Problem Statement• Background

– Develop a machine learning component for a math tutoring system used by high school students (SAT, MCAS)

– Focus on estimating the “state” of a student, which is then used for selecting an appropriate pedagogical action

• Problem– Using a model to estimate student ability, but…– Students appear unmotivated in ~30% of problems

• Solution– Explicitly model motivation (as a dynamic variable) and student proficiency

in a single model

Detection of Motivation

Unmotivated students do not reap the full rewards ofusing a computer-based intelligent tutoring system. Detection of improper behavior is thus an important component of an online student model.

Dynamic mixture model based on Item Response Theory. This model simultaneously estimates a student’s proficiency and changing motivation level.

By accounting for student motivation, the dynamic mixture model researchers can more accurately estimate proficiency and the probability of a correct response.

• Created Item Response Theory (IRT) models for modeling the student's knowledge

• Data consists of responses (correct/incorrect) for 400 students across 70 problems, where a student performs ~33 problems on average

• - implemented an EM algorithm to learn the parameters of the IRT model

• - cross-validated results indicate the model can predict with 72% accuracy how the student will perform on each problem

• - algorithms can be used online to estimate a student's ability while interacting with the tutor

• - currently working on an extension of the IRT model to include information relevant to a student's motivation (time spent on problem, number of hints requested)

Low Student Motivation

• Example: Actual data from a student performing 12 problems (green = correct, red = incorrect)– Problems are of roughly equal difficulty

• Student appears to perform well in beginning and worse toward the end

• Conclusion: The student’s proficiency is average121110987654321 …


• Conclusion: Poor performance on the last five problems is due to low motivation (not proficiency)

1211109876543210

10

20

30

40

50

Time (s)To First

ResponseStudent is

unmotivated

Use observed data to infer motivation!

…


• Opportunity for intelligent tutoring systems to improve student learning by addressing motivation

• This issue is being dealt with on a larger scale by the educational assessment community– Wise & Demars 2005. Low Examinee Effort in Low-

Stakes Assessment: Potential Problems and Solutions. Educational Assessment.

Hidden Markov Model (HMM)• A HMM is used to capture a student’s

changing behavior (level of motivation)

H1 H2 Hn

M1 M2 Mn…

…

Mi (hidden) Hi (observed)

Unmotivated – HintTime to first response < tmin AND Number of hints before correct response > hmax

Unmotivated – GuessTime to first response < tmin AND Number of hints before correct response < hmin

Motivated If other two cases don’t apply

• New edges (in red) change the conditional probability of a student’s response: P(Ui | , Mi)

U1 U2 Un

…

H1 H2 Hn

M1 M2 Mn…

… Motivation (Mi ) affects student response (Ui )

Parameter Estimation• Uses an Expectation-Maximization algorithm to estimate

parameters– M-Step is iterative, similar to the Iterative Reweighted Least Squares

(IRLS) algorithm

• Model consists of discrete and continuous variables– Integral for the continuous variable is approximated using a quadrature

technique

• Only parameters not estimated– P(Ui | , Mi=unmotivated-guess) = 0.2

– P(Ui | , Mi=unmotivated-hint) = 0.02

Modeling Ability and Motivation

• Combined model does not decrease the ability estimate when the student is unmotivated

Combined model separates ability from motivation (IRT model lumps them together)

Experiments• Data: 400 high school students, 70 problems, a student finished 32

problems on average

• Train the Model– Estimate parameters

• Test the Model– For each student, for each problem:

• Estimate and P(Mi) via maximum likelihood• Predict P(Mi+1) given HMM dynamics• Predict Ui+1. Does it match actual Ui+1?

• Compare combined model vs. just an IRT model

Results

• Combined model achieved 72.5% cross-validation accuracy versus 72.0% for the IRT model– Gap is not statistically significant

• Opportunities for improving the accuracy of the combined model– Longer sequences (per student)– Better model of the dynamics, P(Mi+1 | Mi)

Conclusions

• Proposed a new, flexible model to jointly estimate student motivation and ability– Not separating ability from motivation conflates the two

concepts– Easily adjusted for other tutoring systems

• Combined model achieved similar accuracy to IRT model

• Online inference in real-time– Implemented in Java; ran it in one high school in May ’06

Introduction

Model Student Emotion

Model the Domain


Assess Learning

Agenda

Sensors used in the classroom

Bayesian networks and Linear regression models

Linear Models to Predict Emotions

Variables that help predict self-report of emotions. The result suggest that emotion depends on the context in which the emotion occurs (math problem just solved) and also can be predicted from physiological activity captured by the sensors (bottom row).

Introduction

Model the Student

Model the Domain


Assess Learning

Agenda


Learning@ Scale

Domain Model

Kurt VanLehn,

Domain Model

The Andes Bayesian network before (left) and after (right) the observation A-is-a body.Kurt VanLehn.

Domain Model

Student actions (left) and the self-explanation model (right).The physics problem asks the student to fi nd the tension force exerted on a personhanging by a rope tied to his waist. Assume the midshipman was named Jake.

Stephens, 2006

Stephens, 2006

Stephens, 2006

Introduction

Model the Student

Model the Domain


Assess Learning

Agenda

Predicting Student Time To Complete

Two agents were built to predict student time to solve problems (Beck et al., 2000) .

1) Population student model (PSM): responsible for modeling how students interacted with the tutor, based on data from the entire population of users and input characteristics of the student, as well as information about the problem to be solved and output about the expected time (in seconds) the student would need to solve that problem.

2) Pedagogical agent (PA), and it was responsible for constructing a teaching policy. It was a reinforcement learning agent that reasoned about a student’s knowledge and provided customized examples and hints tailored for each student (Beck and Woolf, 2001; Beck et al., 1999a, 2000) .

The tutor predicted a current student’s reaction to a varietyof teaching actions, such as presentation of specifi c problem type.(Beck et al, 2000)

Overview of the ADVISOR machine learning component in AnimalWatch.

The tutor predicted a current student’s reaction to a varietyof teaching actions, such as presentation of specific problem type.Accounted for roughly 50% of the variance in the amount of time the system predicted a student would spend on a problem and the actual time spent to solve a problem.

(Beck et al, 2000)

ADVISOR predicted student response time using its population student model

Cycle Network

Cycle network in DT tutor. The network is rolled out to three time periods representing current, possible, and projected student actions. (From Murray et al., 2004.)

60

Models being EvaluatedSarah Schultz, WPI

Which model, learned over data, helps predict future performance best?

Few issues to solve

61

Problem Selection Within a Topic

Arroyo et al.

EDM Jounral effort.

62

Pedagogical Moves : Dynamically adjusted Empirical-based estimates of effort lead to adjusted problem difficulty and other affective and meta-cognitive feedback

63

E(Ii)

IL IH

E(Hi)

HL HH

E(Ti)

TLTH

0 1 2 3 4 0 1 2 3 4 5 6 7

Incorrect Attempts Hints Time (each bar=5seconds)

What is “normal” behavior?In EACH problem pi i=1, .., N N=Total problems in system

Within expected behavior

A new student encounters this problem…Is their behavior within expectation, or atypical?

Looking across the whole population of students who used a problem

64

What is odd behavior?

E(Ii)

IL IH

E(Hi)

HL HH

E(Ti)

TLTH

0 1 2 3 4 0 1 2 3 4 5 6 7

Incorrect Attempts Hints Time (each bar=5seconds)

Attempts < E(Ii) — IL Hints > E(Hi) + HH Time < E(Ti) — TL

In any problem pi i=1, .., N N=Total problems in system

Odd behavior

Few Inc. Attempts Lots of Hints Little Time< > <

65

Increasing Problem DifficultyAt the next time step. Assume we know problem difficulty of items.

LastProbSeen

Sorted list of harder math problems

Hardest of allEasiest

m

H=

= 3Parameter

X

--> Challenge rate

66

Decreasing Problem DifficultyAt the next time step. Assume we know problem difficulty of items.

LastProbSeen

Sorted list of easier math problems

HardestEasiest of all

n

E=

= 3Parameter

X

Introduction

Model the Student

Model the Domain


Assess Learning

Agenda

Learning@ Scale

Stanford’s Computer Science Course

Machine learning techniques were used to autonomously create a graphical model of how students in an introductory programming course progress through the homework assignment.

Machine learning algorithms found patterns in how students solved the Checkerboard Karel problem. These patterns were more informative at predicting how well students would perform on the class midterm than the grades students received on the assignment. The algorithm captured a meaningful general trend about how students were solving programming problems.

Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.

Student Modeling in Computer Programming

Bag of Words Difference: Researchers first built histograms of the different key words used in a computer program and used the Euclidean distance between two histograms as a naïve measure of the dissimilarity. This is akin to distance measures of text commonly used in information retrieval systems.

Application Program Interface (API) Call Dissimilarity: They ran each program with standard inputs and recorded the resulting sequence of API calls. They used Needleman-Wunsch global DNA alignment to measure the difference between the lists of API calls generated by the two programs.

Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of

the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.

Hidden Markov Model

The first step in their student modeling process was to learn a high level representation of how each student progressed through the checkerboard Karel assignment. To learn this representation they modeled a student’s progress as a Hidden Markov Model (HMM) [17].

Learning a HMM. Each state from the HMM becomes a node in the FSM and the weight of a directed edge from one node to another provides the probability of transitioning from one state to the next. The program's Hidden Markov Model of state transitions for a given student. The node "codet" denotes the code snapshot of the student at time t, and the node "statet" denotes the high-level milestone that the student is in at time t. N is the number of snapshots for the student.


Dissimilarity Matrix

Clustering on a sample of 2000 random snapshots from the training set returned a group of well-defined snapshot clusters (see Figure 2). The value of K that maximized silhouette score (a measure of how natural the clustering was) was 26 clusters. A visual inspection of these clusters confirmed that snapshots which clustered together were functionally similar pieces of code.

Dissimilarity matrix for clustering of 2000 snapshots. Each row and column in the matrix represents a snapshot and the entry at row i, column j represents how similar snapshot i and j are (dark means more similar)


The finite set of high-level or milestones that a student could be in. A state is defined by a set of snapshots where all the snapshots in the set came from the same milestone. The transition probability, of being in a state given the state the student was in in the previous unit of time.

The emission probability, of seeing a specific snapshot given that you are in a particular state. To calculate the emission probability we interpreted each of the states as emitting snapshots with normally distributed dissimilarities. In other words, given the dissimilarity between a particular snapshot of student code and a state’s "representative" snapshot, we can calculate the probability that the student snapshot came from a given state using a Normal distribution based on the dissimilarity.

Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of

the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.

The landscape of solutions for “gradient descent for linear regression” representing over 40,000 student code submissions with edges drawn between syntactically similar submissions and colors corresponding to performance on a

battery of unit tests (red submissions passed all unit tests).

Huang, J., Piech, C., Nguyen, A., & Guibas, L. (2013, June). Syntactic and functional variability of a million code submissions in a machine learning mooc. In AIED 2013 Workshops Proceedings Volume (p. 25).

Stanford’s MOOC:Teaching Machine Learning topics

Hour of Code Challenge Modeling How Young Students Learn to Program

Code.org problem solving graph of learned policy for how to solve a single open ended programming assignment from over 1M users. Each node is a unique partial-solution. The node 0 is the correct answer.

Chris Piech, Stanford Ph.D. student

Correct Answer

Arc: Next solution an expert would

recommend.

Node: unique partial solution.

Improved Retention

Code.org gathered over 137 million partialsolutions. Not all students made it through the entire Hour of Code but retention was quite high relative to other contemporaryopen access courses.

63K Peer Grading for 7K studentsBlue Blob: Student A

Red Circle: Students who

were graded by Student A.

Red Squares: Students who graded

Student A

A Coursera course to teach HCI. Peer grading network of 63K peer grades for 7K students. A single student is highlighted, red squares graded the student, red circles were graded by the student.

Chris Piech, Stanford Ph.D. student

Lan, A. S., Studer, C., Waters, A. E., & Baraniuk, R. G. (2013). Joint topic modeling and factor analysis of textual information and graded response data. arXiv preprint arXiv:1305.1956.

Circles: Concepts

Squares: Questions

Edges: StrongQuestion Concept

Relationship

Introduction

Model the Student

Model the Domain


Assess Learning

Agenda


Learning@ Scale

Long term goal

Millions of schoolchildren will have access to what Alexander the Great enjoyed as a royal prerogerative: “the personal services of a tutor as well informed as Aristotle”



Long term goal

Millions of schoolchildren will have access to what Alexander the Great enjoyed as a royal prerogerative: “the personal services of a tutor as well informed as Aristotle”



Thank You !Any Questions?

Learning to Teach: Machine Learning Techniques

To Improving Instruction

NIPS 2015 Workshop on Human Propelled Machine Learning

Dec 13, 2014