Upload
beverly-park-woolf
View
50
Download
1
Embed Size (px)
Citation preview
Beverly Park Woolf School of Computer Science, University of Massachusetts
Learning to Teach: Machine Learning to Improve
Instruction
NIPS 2015 Workshop on Human Propelled Machine Learning, Dec 13, 2014
Long, long Term Goal
Millions of schoolchildren will have access to what Alexander the Great enjoyed as a royal prerogerative:
“the personal services of a tutor as well informed as Aristotle”
Pat Suppes, Stanford University, 1966Died Nov 2014)
”Students will have instant access to vast stores of knowledge through their computerized tutors”
Alexander the Great valued learning so highly, that he said he was more indebted to Aristotle for giving him knowledge than to his father for giving him life.
We are on track.
Key components:
Artificial IntelligenceMachine LearningLearning Sciences
We are able to achieve personal services of a tutor for every student and instant access to vast stores of knowledge
Then: ~ 400 BCNow: 2014
Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Intelligent TutoringSystems
Learning@ Scale
Research Questions
How to retrieve substance from educational data?
What do teachers and students need to know?
What do researchers in Learning Sciences want to know?
• Explore large educational data sets and how they are analyzed– create models and pattern finding.
• How are researchers in the field of educational technology using a variety of techniques to use data to improve teaching and learning?
What Kind of ML Techniques?
– Visualization and modeling – Decision trees – Bayesian networks– Logistic Regression– Temporal Models– Markov Models – Classification: Naïve Bayes, Neural Networks,
Decision trees
Reasoning about the Learner with Machine Learning
TechniquesOpen
Learner Models
Models for
Teachers, Parents
Models of the Domain
Models of Student
Knowledge, Learning
Models of Student
Affect/Motivatio
Engagement/Use and
Misuse/On-off task
Pedagogical
Moves and
Tutorial Actions
Pre-processing: Discretizing Variables, Normalizing and Transforming Variables
Arroyo, EDM 2010 Baker Arroyo, EDM 2010
Visualizations: Single Variables and Relationshipts Bull & Mitrovic Ritter EDM2011 best
paper
Models: Correlations/Crosstabulations Arroyo, Log Files Arroyo, Log Files Arroyo, Log Files
Models: Causal Modeling Beck & Rai
Models: Linear Regression Heffernan Koedinger Arroyo, Shanabrook; Baker
Arroyo --Animalwatc
h
Models: Feature Selection. Splitting Models vs. Accounting for. Martin & Koedinger Arroyo
Classification: Logistic Regression Pavlik (PFA); Gong & Beck, v Cooper, David Beck
Classification: Clustering Desmarais -- non negateive matrix
factorization
Yue Gong UMAP2012, clustering without
features :-)
Classification: Naive Bayes Stern, MANIC
Classification: Neural Networks Burns, Handwriting D'Mello: Predicting affective states Baker
Classification: Decision Trees random forest
approach was widely used in KDD2011 cup
de Vicente & Pain
Models: Association Rule Learning Romero Merceron
Temporal Models: Temporal Patterns and Trails over observable variables, and Markov Chains
Romero (Educational Trails) Shanabrook Shanabrook Shanabrook
Models: Bayesian Networks Zapata-Rivera HeffernanLots of classic ITS work
(HYDRIVE; William Murray)
Conati; Arroyo; Rai Chaz Murray RTDT
Temporal Models: Hidden Markov Models (latent variables) Mayo & Mitrovic Beck; Pardos Johns & Woolf Ivon Arroy, Worcester Polytechnic
Institute
Data Sets Used
–Data sets come from Log Files – Educational tutoring and assessment
software,
Large Data Sets
EventLog Table of a Math Tutoring System. 571,776 rows, just in a year time.
Introduction
Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Agenda
Intelligent TutoringSystems
Learning@ Scale
Student Model
Student Model
Student Model
A data-driven approach toward automatic prediction of studentsemotional states without sensors and while students are still actively engaged in their learning.
Models from students ongoing behavior. A cross-validation revealed small gains in accuracy for the more sophisticated state-basedmodels and better predictions of the remaining unpredicted cases, compared to the baseline models.
By modifying the context of the tutoring system including students perceived emotion around mathematics, a tutor can nowoptimize and improve a students mathematics attitudes.
David H. Shanabrook, David G. Cooper, Beverly Park Woolf, and Ivon Arroyo
Student States Describing student/tutor interaction
Problem state patterns
IBMs Many Eyes Word Tree algorithm. The total 1280 ATT (attempted and solved) events. Most frequently ATT was followed by a SOF event (see top tree). The second level of the tree shows that the sequence ATT ATT the highest frequent event changes to the ATTevent, i.e. the shift in behavior occurs after two ATT states (see second tree andtop branch). This indicates the ATT state is more often a solitary event, wherethe ATT ATT pattern will continue in the ATT state. Thus, from the analysis the most frequent 3 problem state patterns (e.g., NOTR-NOTR-NOTR) aredetermined (see third tree and second branch).
Jeff JohnsAutonomous
Learning Laboratory
Beverly WoolfCenter for KnowledgeCommunication
AAAI 7/20/2006
A Dynamic Mixture Model to Detect Student Motivation and Proficiency
Problem Statement• Background
– Develop a machine learning component for a math tutoring system used by high school students (SAT, MCAS)
– Focus on estimating the “state” of a student, which is then used for selecting an appropriate pedagogical action
• Problem– Using a model to estimate student ability, but…– Students appear unmotivated in ~30% of problems
• Solution– Explicitly model motivation (as a dynamic variable) and student proficiency
in a single model
Detection of Motivation
Unmotivated students do not reap the full rewards ofusing a computer-based intelligent tutoring system. Detection of improper behavior is thus an important component of an online student model.
Dynamic mixture model based on Item Response Theory. This model simultaneously estimates a student’s proficiency and changing motivation level.
By accounting for student motivation, the dynamic mixture model researchers can more accurately estimate proficiency and the probability of a correct response.
• Created Item Response Theory (IRT) models for modeling the student's knowledge
• Data consists of responses (correct/incorrect) for 400 students across 70 problems, where a student performs ~33 problems on average
• - implemented an EM algorithm to learn the parameters of the IRT model
• - cross-validated results indicate the model can predict with 72% accuracy how the student will perform on each problem
• - algorithms can be used online to estimate a student's ability while interacting with the tutor
• - currently working on an extension of the IRT model to include information relevant to a student's motivation (time spent on problem, number of hints requested)
Low Student Motivation
• Example: Actual data from a student performing 12 problems (green = correct, red = incorrect)– Problems are of roughly equal difficulty
• Student appears to perform well in beginning and worse toward the end
• Conclusion: The student’s proficiency is average121110987654321 …
Low Student Motivation
• Conclusion: Poor performance on the last five problems is due to low motivation (not proficiency)
1211109876543210
10
20
30
40
50
Time (s)To First
ResponseStudent is
unmotivated
Use observed data to infer motivation!
…
Low Student Motivation
• Opportunity for intelligent tutoring systems to improve student learning by addressing motivation
• This issue is being dealt with on a larger scale by the educational assessment community– Wise & Demars 2005. Low Examinee Effort in Low-
Stakes Assessment: Potential Problems and Solutions. Educational Assessment.
Hidden Markov Model (HMM)• A HMM is used to capture a student’s
changing behavior (level of motivation)
H1 H2 Hn
M1 M2 Mn…
…
Mi (hidden) Hi (observed)
Unmotivated – HintTime to first response < tmin AND Number of hints before correct response > hmax
Unmotivated – GuessTime to first response < tmin AND Number of hints before correct response < hmin
Motivated If other two cases don’t apply
• New edges (in red) change the conditional probability of a student’s response: P(Ui | , Mi)
U1 U2 Un
…
H1 H2 Hn
M1 M2 Mn…
… Motivation (Mi ) affects student response (Ui )
Parameter Estimation• Uses an Expectation-Maximization algorithm to estimate
parameters– M-Step is iterative, similar to the Iterative Reweighted Least Squares
(IRLS) algorithm
• Model consists of discrete and continuous variables– Integral for the continuous variable is approximated using a quadrature
technique
• Only parameters not estimated– P(Ui | , Mi=unmotivated-guess) = 0.2
– P(Ui | , Mi=unmotivated-hint) = 0.02
Modeling Ability and Motivation
• Combined model does not decrease the ability estimate when the student is unmotivated
Combined model separates ability from motivation (IRT model lumps them together)
Experiments• Data: 400 high school students, 70 problems, a student finished 32
problems on average
• Train the Model– Estimate parameters
• Test the Model– For each student, for each problem:
• Estimate and P(Mi) via maximum likelihood• Predict P(Mi+1) given HMM dynamics• Predict Ui+1. Does it match actual Ui+1?
• Compare combined model vs. just an IRT model
Results
• Combined model achieved 72.5% cross-validation accuracy versus 72.0% for the IRT model– Gap is not statistically significant
• Opportunities for improving the accuracy of the combined model– Longer sequences (per student)– Better model of the dynamics, P(Mi+1 | Mi)
Conclusions
• Proposed a new, flexible model to jointly estimate student motivation and ability– Not separating ability from motivation conflates the two
concepts– Easily adjusted for other tutoring systems
• Combined model achieved similar accuracy to IRT model
• Online inference in real-time– Implemented in Java; ran it in one high school in May ’06
Introduction
Model Student Emotion
Model the Domain
Personalize Tutoring
Assess Learning
Agenda
Sensors used in the classroom
Bayesian networks and Linear regression models
Linear Models to Predict Emotions
Variables that help predict self-report of emotions. The result suggest that emotion depends on the context in which the emotion occurs (math problem just solved) and also can be predicted from physiological activity captured by the sensors (bottom row).
Introduction
Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Agenda
Intelligent TutoringSystems
Learning@ Scale
Domain Model
Kurt VanLehn,
Domain Model
The Andes Bayesian network before (left) and after (right) the observation A-is-a body.Kurt VanLehn.
Domain Model
Student actions (left) and the self-explanation model (right).The physics problem asks the student to fi nd the tension force exerted on a personhanging by a rope tied to his waist. Assume the midshipman was named Jake.
Stephens, 2006
Stephens, 2006
Stephens, 2006
Introduction
Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Agenda
Predicting Student Time To Complete
Two agents were built to predict student time to solve problems (Beck et al., 2000) .
1) Population student model (PSM): responsible for modeling how students interacted with the tutor, based on data from the entire population of users and input characteristics of the student, as well as information about the problem to be solved and output about the expected time (in seconds) the student would need to solve that problem.
2) Pedagogical agent (PA), and it was responsible for constructing a teaching policy. It was a reinforcement learning agent that reasoned about a student’s knowledge and provided customized examples and hints tailored for each student (Beck and Woolf, 2001; Beck et al., 1999a, 2000) .
The tutor predicted a current student’s reaction to a varietyof teaching actions, such as presentation of specifi c problem type.(Beck et al, 2000)
Overview of the ADVISOR machine learning component in AnimalWatch.
The tutor predicted a current student’s reaction to a varietyof teaching actions, such as presentation of specific problem type.Accounted for roughly 50% of the variance in the amount of time the system predicted a student would spend on a problem and the actual time spent to solve a problem.
(Beck et al, 2000)
ADVISOR predicted student response time using its population student model
Cycle Network
Cycle network in DT tutor. The network is rolled out to three time periods representing current, possible, and projected student actions. (From Murray et al., 2004.)
60
Models being EvaluatedSarah Schultz, WPI
Which model, learned over data, helps predict future performance best?
Few issues to solve
61
Problem Selection Within a Topic
Arroyo et al.
EDM Jounral effort.
62
Pedagogical Moves : Dynamically adjusted Empirical-based estimates of effort lead to adjusted problem difficulty and other affective and meta-cognitive feedback
63
E(Ii)
IL IH
E(Hi)
HL HH
E(Ti)
TLTH
0 1 2 3 4 0 1 2 3 4 5 6 7
Incorrect Attempts Hints Time (each bar=5seconds)
What is “normal” behavior?In EACH problem pi i=1, .., N N=Total problems in system
Within expected behavior
A new student encounters this problem…Is their behavior within expectation, or atypical?
Looking across the whole population of students who used a problem
64
What is odd behavior?
E(Ii)
IL IH
E(Hi)
HL HH
E(Ti)
TLTH
0 1 2 3 4 0 1 2 3 4 5 6 7
Incorrect Attempts Hints Time (each bar=5seconds)
Attempts < E(Ii) — IL Hints > E(Hi) + HH Time < E(Ti) — TL
In any problem pi i=1, .., N N=Total problems in system
Odd behavior
Few Inc. Attempts Lots of Hints Little Time< > <
65
Increasing Problem DifficultyAt the next time step. Assume we know problem difficulty of items.
LastProbSeen
Sorted list of harder math problems
Hardest of allEasiest
m
H=
= 3Parameter
X
--> Challenge rate
66
Decreasing Problem DifficultyAt the next time step. Assume we know problem difficulty of items.
LastProbSeen
Sorted list of easier math problems
HardestEasiest of all
n
E=
= 3Parameter
X
Introduction
Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Agenda
Learning@ Scale
Stanford’s Computer Science Course
Machine learning techniques were used to autonomously create a graphical model of how students in an introductory programming course progress through the homework assignment.
Machine learning algorithms found patterns in how students solved the Checkerboard Karel problem. These patterns were more informative at predicting how well students would perform on the class midterm than the grades students received on the assignment. The algorithm captured a meaningful general trend about how students were solving programming problems.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.
Student Modeling in Computer Programming
Bag of Words Difference: Researchers first built histograms of the different key words used in a computer program and used the Euclidean distance between two histograms as a naïve measure of the dissimilarity. This is akin to distance measures of text commonly used in information retrieval systems.
Application Program Interface (API) Call Dissimilarity: They ran each program with standard inputs and recorded the resulting sequence of API calls. They used Needleman-Wunsch global DNA alignment to measure the difference between the lists of API calls generated by the two programs.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of
the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.
Hidden Markov Model
The first step in their student modeling process was to learn a high level representation of how each student progressed through the checkerboard Karel assignment. To learn this representation they modeled a student’s progress as a Hidden Markov Model (HMM) [17].
Learning a HMM. Each state from the HMM becomes a node in the FSM and the weight of a directed edge from one node to another provides the probability of transitioning from one state to the next. The program's Hidden Markov Model of state transitions for a given student. The node "codet" denotes the code snapshot of the student at time t, and the node "statet" denotes the high-level milestone that the student is in at time t. N is the number of snapshots for the student.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.
Dissimilarity Matrix
Clustering on a sample of 2000 random snapshots from the training set returned a group of well-defined snapshot clusters (see Figure 2). The value of K that maximized silhouette score (a measure of how natural the clustering was) was 26 clusters. A visual inspection of these clusters confirmed that snapshots which clustered together were functionally similar pieces of code.
Dissimilarity matrix for clustering of 2000 snapshots. Each row and column in the matrix represents a snapshot and the entry at row i, column j represents how similar snapshot i and j are (dark means more similar)
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.
The finite set of high-level or milestones that a student could be in. A state is defined by a set of snapshots where all the snapshots in the set came from the same milestone. The transition probability, of being in a state given the state the student was in in the previous unit of time.
The emission probability, of seeing a specific snapshot given that you are in a particular state. To calculate the emission probability we interpreted each of the states as emitting snapshots with normally distributed dissimilarities. In other words, given the dissimilarity between a particular snapshot of student code and a state’s "representative" snapshot, we can calculate the probability that the student snapshot came from a given state using a Normal distribution based on the dissimilarity.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February). Modeling how students learn to program. In Proceedings of
the 43rd ACM technical symposium on Computer Science Education (pp. 153-160). ACM.
The landscape of solutions for “gradient descent for linear regression” representing over 40,000 student code submissions with edges drawn between syntactically similar submissions and colors corresponding to performance on a
battery of unit tests (red submissions passed all unit tests).
Huang, J., Piech, C., Nguyen, A., & Guibas, L. (2013, June). Syntactic and functional variability of a million code submissions in a machine learning mooc. In AIED 2013 Workshops Proceedings Volume (p. 25).
Stanford’s MOOC:Teaching Machine Learning topics
Hour of Code Challenge Modeling How Young Students Learn to Program
Code.org problem solving graph of learned policy for how to solve a single open ended programming assignment from over 1M users. Each node is a unique partial-solution. The node 0 is the correct answer.
Chris Piech, Stanford Ph.D. student
Correct Answer
Arc: Next solution an expert would
recommend.
Node: unique partial solution.
Improved Retention
Code.org gathered over 137 million partialsolutions. Not all students made it through the entire Hour of Code but retention was quite high relative to other contemporaryopen access courses.
63K Peer Grading for 7K studentsBlue Blob: Student A
Red Circle: Students who
were graded by Student A.
Red Squares: Students who graded
Student A
A Coursera course to teach HCI. Peer grading network of 63K peer grades for 7K students. A single student is highlighted, red squares graded the student, red circles were graded by the student.
Chris Piech, Stanford Ph.D. student
Lan, A. S., Studer, C., Waters, A. E., & Baraniuk, R. G. (2013). Joint topic modeling and factor analysis of textual information and graded response data. arXiv preprint arXiv:1305.1956.
Circles: Concepts
Squares: Questions
Edges: StrongQuestion Concept
Relationship
Introduction
Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Agenda
Intelligent TutoringSystems
Learning@ Scale
Long term goal
Millions of schoolchildren will have access to what Alexander the Great enjoyed as a royal prerogerative: “the personal services of a tutor as well informed as Aristotle”
Pat Suppes, Stanford University, 1966Died Nov 2014)
”Students will have instant access to vast stores of knowledge through their computerized tutors”
Long term goal
Millions of schoolchildren will have access to what Alexander the Great enjoyed as a royal prerogerative: “the personal services of a tutor as well informed as Aristotle”
Pat Suppes, Stanford University, 1966Died Nov 2014)
”Students will have instant access to vast stores of knowledge through their computerized tutors”
Thank You !Any Questions?
Learning to Teach: Machine Learning Techniques
To Improving Instruction
NIPS 2015 Workshop on Human Propelled Machine Learning
Dec 13, 2014