9
Open Research Online The Open University’s repository of research publications and other research outputs Open University Learning Analytics dataset Journal Item How to cite: Kuzilek, Jakub; Hlosta, Martin and Zdrahal, Zdenek (2017). Open University Learning Analytics dataset. Scientific Data, 4, article no. 170171. For guidance on citations see FAQs . c 2017 The Authors Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.1038/sdata.2017.171 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk

Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

Open Research OnlineThe Open University’s repository of research publicationsand other research outputs

Open University Learning Analytics datasetJournal ItemHow to cite:

Kuzilek, Jakub; Hlosta, Martin and Zdrahal, Zdenek (2017). Open University Learning Analytics dataset. ScientificData, 4, article no. 170171.

For guidance on citations see FAQs.

c© 2017 The Authors

Version: Version of Record

Link(s) to article on publisher’s website:http://dx.doi.org/doi:10.1038/sdata.2017.171

Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyrightowners. For more information on Open Research Online’s data policy on reuse of materials please consult the policiespage.

oro.open.ac.uk

Page 2: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

Data Descriptor: Open UniversityLearning Analytics datasetJakub Kuzilek1,2, Martin Hlosta1 & Zdenek Zdrahal1,2

Learning Analytics focuses on the collection and analysis of learners’ data to improve their learningexperience by providing informed guidance and to optimise learning materials. To support the research inthis area we have developed a dataset, containing data from courses presented at the Open University(OU). What makes the dataset unique is the fact that it contains demographic data together withaggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). Thisenables the analysis of student behaviour, represented by their actions. The dataset contains theinformation about 22 courses, 32,593 students, their assessment results, and logs of their interactions withthe VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freelyavailable at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.

Design Type(s) time series design • data integration objective • observation design

Measurement Type(s) learning behavior

Technology Type(s) digital curation

Factor Type(s) temporal_interval

Sample Characteristic(s) Homo sapiens

1Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2CIIRC, CzechTechnical University, Zikova street 1903/4, Prague 166 36, Czech Republic. Correspondence and requests formaterials should be addressed to J.K. (email: [email protected]).

OPEN

Received: 20 July 2017

Accepted: 13 October 2017

Published: 28 November 2017

www.nature.com/scientificdata

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 1

Page 3: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

Background & SummaryWith the rapid advancement of information technologies, the higher education sector experienced amassive increase in the amount of student data collected. In addition, Virtual Learning Environmentsemerged and moved courses to the Internet. This transfer was further supported by the boom of MassiveOpen Online Courses (MOOCs). In the past decade over 200 scientific studies investigated the impact ofstudent data analysis1. This shows the importance of open datasets, which provide a standardised way topresent and compare results.

To the best of our knowledge, two other open datasets for Learning Analytics exist. The first is theKDD Cup 2010 dataset2, which provides data in the form of interaction records between students and acomputer-aided-tutoring system. The second dataset is the KDD Cup 2015 dataset3 extracted fromXuetangX MOOC platform. This dataset contains data about the structure of 40 courses and interactionsbetween students and VLE. It does not include any demographic and historical data from past courses.

In comparison to the above-mentioned datasets, the Open University Learning Analytics dataset(OULAD) contains a subset of the OU student data from 2013 and 2014. It includes both studentdemographic data and interaction data with the university’s VLE.

To better understand the data the description of the OU and its learning and teaching system follows.The Open University is one of the largest distance learning universities worldwide. At present, around170,000 students are registered in different programs. Teaching materials and other content are deliveredto students via the VLE. Students’ interactions with the educational materials are recorded and stored inuniversity data warehouse.

At the OU, courses are called modules. Modules can be presented multiple times during the year. Todistinguish between different presentations of a module, each presentation is named by the year andmonth it starts. For example, presentations starting in January ends with A, in February with B and so on;so that ‘2013J’ means that the presentation started in October 2013.

The university offers several hundred modules. Each of them can be studied as a stand-alone course oras part of a university programme. No previous qualifications are required.

During the admission process, students are informed about the Data Protection Policy and Policy onEthical use of Student Data for Learning Analytics. They cover the essential information regarding of useof their personal data. The students are informed that they data are used for academic research purposesand they might be shared with other researchers. In addition, the OU promises not to release data, whichmight be used to identify individual students, this is not the case of OULAD since this dataset isanonymised and cannot be used to link data with the individual students. At present, the OU does notprovide the possibility to opt-out of data usage for Learning Analytics research.

Students in a module-presentation are organised into study groups of approximately 20 people. Eachgroup has an assigned tutor, who guides and supports students throughout the module presentation.

Figure 1 depicts the typical structure of one module-presentation. The typical length of presentationsis 9 months. Module resources are available from the VLE system a few weeks before the start of thepresentation. Students can sign-up for the module from a few months before the start of the presentationuntil two weeks after the official module start date. Each module includes several assessments. At the endof the module, there is usually a final exam.

As stated before the OULAD is a collection of tabular student data from years 2013 and 2014. Eachtable contains different information, which can be linked to data from other tables using identifiercolumns. The data contained in the dataset are structured as shown in Fig. 2. The dataset is studentoriented, thus the student is the central point. Students’ data includes information about theirdemographics and registrations for the modules. For each student-module-presentation triplet, thedataset contains the results of the students’ assessments. Student interactions with the VLE are logged as asummary of their daily activities. The dataset contains 22 module-presentations with 32,593 registeredstudents and it is freely available at https://analyse.kmi.open.ac.uk/open_dataset. OULAD has beencertified by the Open Data Institute (http://theodi.org/).

The dataset may be used in various scenarios. It enables evaluation of predictive models for predictingstudent assessment results and final course results and comparison of models with other models developby other researchers. The VLE data enable to study course structure from the learning perspective and thedata itself can be used to evaluate the influence of VLE on the learning outcomes.

Figure 1. Typical module structure. Module-presentation content is usually available in VLE couple of weeksbefore official module start. During the presentation the students’ their knowledge is evaluated in series ofassessments, which defines the milestones in the module. At the end, there is usually the final exam.

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 2

Page 4: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

MethodsThis section describes the stages of preparing the dataset: collection, selection, anonymisation and release.The whole process is depicted in Fig. 3.

The open university data collection processAt the OU, various information systems for student and module support exist. Due to variability ininformation collected within each system, the OU implemented a data warehouse, which aggregatesinformation from all available systems. The warehouse is built using SAS technology (https://www.sas.com).

In general, we distinguish three different data types:

● Demographic—represents the basic information about the students including their age, gender, region,previous education, etc.

● Performance—reflects students’ results and achievements during their studies at the OU.● Learning behaviour—is the log of student activities in the VLE.

Figure 3. Dataset preparation process. Data were extracted from the OU data warehouse. Then the selectedsubset has been anonymised, certified by the Open Data Institute and released.

Figure 2. Overall dataset structure. The student is linked with the information about his/her demographicsand registrations for the modules. For each student-module-presentation triplet, the dataset contains the resultsof the students’ assessments and logs of student interactions with VLE.

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 3

Page 5: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

Data selectionThe data warehouse contains information about students, namely: their demographics, modules and VLEactivities, since 2012. We selected several representative modules taught at the OU during 2013 and 2014.The selection process followed these rules:

● The number of students in the selected module-presentation is larger than 500.● At least two presentations of the module exist.● VLE data are available for the module-presentation (since not all the modules are studied via VLE).● The module has a significant number of failing students.

Out of the all modules that satisfy these criteria we selected 7 modules: 4 Science, Technology,Engineering, and Mathematics (STEM) modules and 3 Social Sciences modules. The total number ofstudents in the selected modules is 38,239.

Data anonymisationThe dataset anonymisation process was designed according to the ethical and privacy requirementsapplied at the OU. The whole process of dataset creation and release was supervised by the OUmanagement, and approved by the Vice-Chancellor Executive committee. Anonymisation itself wasperformed in a series of steps. The first step removed private information about students and modules.This includes the social security number, dates of birth and unique identifiers used at the OU forstudents. Module names have been replaced by semantic-free symbols and all temporal information hasbeen expressed in relative terms with respect to the presentation start. In addition, all numeric identifiers(i.e., student_id, code_module, etc.) have been reassigned and completely randomised.

Next, we identified quasi-identifying attributes4. These are:

● Gender,● Index of multiple deprivation band (IMD band)5,● Highest education level,● Age,● Region, in which student lives, and● Disability.

These attributes could be used to identify a person using other publicly available sources, thus thenecessity to preserve anonymity requires the application of additional anonymisation methods. For thatpurpose, we used the ARX anonymisation tool6, which is widely used in medical domains. Using theinformation and expert knowledge in the domain we constructed an anonymisation hierarchy for eachquasi-identifying attribute, and then applied several anonymisation rules on the dataset using ARX.

The main measure of anonymity we selected was the k-anonymity measure with k set to 5. We set thesuppression limit to 0.7, this means that the ARX tool prefers to remove ‘outliers’more often than anonymisationof the quasi-identifier. We also set a maximum number of removed ‘outliers’ to 20 % of all record entries. Thelast parameter to be set is average re-identification risk criterion, which is set to 0.05. The anonymisation processreduced the number of students to 32,593 and generalised the Age and IMD band attributes.

Code availabilityFor the anonymisation we used the ARX anonymisation tool (http://arx.deidentifier.org/) version 3.2.1.

Data RecordsDataset (Data Citation 1) is available as a set of separate CSV files (comma separated values, each value iswithin quotation marks and the first line represents column names). Each file contains one ‘database’table. Tables can be connected using unique identifiers (columns).

Figure 4 shows the detailed structure of the dataset.Table studentInfo can be linked tostudentAssessment, studentVle and studentRegistration tables using column id_student. Table courseslinks to the assessments, studentRegistration, vle and studentInfo using identifier columns

Figure 4. Detailed dataset structure. Table studentInfo is linked to vle courses and assessments table viacoresponding "student" table.

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 4

Page 6: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

code_module and code_presentation. Finally assessments table links to studentAssessment usingid_assessment and vle to studentVle using id_site.

The following subsections describe each table in detail. The column identifiers are highlighted using italics.

Table studentInfoThis table contains student demographic information and also their results in each module they studied.It consists of 32,593 rows with the following columns:

● code_module—module identification code on which the student is registered.● code_presentation—presentation identification code during which the student is registered on the

module.● id_student—the unique student identification number.● gender—student’s gender.● region—the geographic region, where the student lived while taking the module-presentation.● highest_education—the highest student education level on entry to the module presentation.● imd_band—the IMD band of the place where the student lived during the module-presentation.● age_band—a band of student’s age.● num_of_prev_attempts—the number of how many times the student has attempted this module.● studied_credits—the total number of credits for the modules the student is currently studying.● disability—indicates whether the student has declared a disability.● final_result—student’s final result in the module-presentation.

Table coursesThe table contains the list of all available modules and their presentations. It consists of 22 rows with thefollowing columns:

● code_module—code name of the module, which serves as the identifier.● code_presentation—code name of the presentation.● length—the length of the module-presentation in days from module start date to module end date.

The structure of B and J presentations may differ and therefore it is recommended to analyse the Band J presentations separately. The following table (Table 1) contains information about the studydomain, the number of students and the number of presentations of each module included in the dataset.

Table studentRegistrationContains information about the time when the student registered for the module presentation. Forstudents who unregistered, the date of un-registration is also recorded. It consists of 32,593 rows with thefollowing columns:

● code_module—the module identification code.● code_presentation—the presentation identification code.● id_student—the unique student identification number.● date_registration—the day of student’s registration for the module presentation.● date_unregistration—the day of student unregistration from the module presentation. Students, who

completed the course have this field empty. Students who unregistered have Withdrawal as the value ofthe final_result in the studentInfo table.

Table assessmentsThis table contains information about assessments in module-presentations. Usually, every presentation has anumber of assessments followed by the final exam. The table consists of 206 rows with the following columns:

Module Domain Presentations Students

AAA Social Sciences 2 748

BBB Social Sciences 4 7,909

CCC STEM 2 4,434

DDD STEM 4 6,272

EEE STEM 3 2,934

FFF STEM 4 7,762

GGG Social Sciences 3 2,534

Table 1. Module summary and domain information.

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 5

Page 7: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

● code_module—module identification code, to which the assessment belongs.● code_presentation—presentation identification code, to which the assessment belongs.● id_assessment—assessment identification number.● assessment_type—a type of assessment. Three types of assessments exist—Tutor Marked Assessment

(TMA), Computer Marked Assessment (CMA) and Final Exam (Exam).● date—information about the cut-off day of the assessment.● weight—the weight of the assessment. Typically, Exams are treated separately and have the weight

equal to 100%; the sum of all other assessments is also 100%.

If the information about the final exam cut-off day is missing, it takes place during the last week of themodule-presentation.

Table studentAssessmentThe table contains the results of students’ assessments. If the student does not submit the assessment, noresult is recorded. Results of the final exam are usually missing (since they are scored and used for the finalmarking immediately at the end of the module). It consists of 173,912 rows with the following columns:

● id_assessment—the assessment identification number.● id_student—the unique student identification number.● date_submitted—the day of assessment submission.● is_banked—the status flag indicating that the assessment result has been transferred from a previous

presentation.● score—the student’s score in this assessment. The range is from 0 to 100. The score lower than 40 is

interpreted as Fail. The marks are in the range from 0 to 100.

Table studentVleThe studentVle table contains information about student’s interactions with the VLE. It consists of10,655,280 rows with the following columns:

● code_module—the module identification code.● code_presentation—the presentation identification code.● id_student—the unique student identification number.● id_site—the VLE material identification number.● date—the day of student’s interaction with the material.● sum_click—the number of times the student interacted with the material.

Table vleThe vle table contains information about the materials available in the VLE. Typically these are HTMLpages, pdf files, etc. It consists of 6,364 rows with the following columns:

● id_site—the identification number of the material.● code_module—the identification code for the module.● code_presentation—the identification code of the presentation.● activity_type—the role associated with the module material.● week_from—the week from which the material is planned to be used.● week_to—the week until which the material is planned to be used.

Technical ValidationOULAD contains data from years 2013 and 2014. We intended to evaluate that the data in the dataset stillreflects the current student population. For that purpose, we compared the OULAD data with thecorresponding data from 2015. We selected one module, namely CCC, and compared it with the data from2015. For the comparison, we employed Chi-squared test of goodwill7 or Wilcoxon rank sum test8 (depends

Attribute Test Degrees of freedom Test value P-value

Age Wilcox — 17,606,000 0.3558

Disability χ2 1 1.0739 0.3001

Education χ2 4 0.89201 0.9257

Gender χ2 1 2.0537 0.1518

IMD χ2 19 15.912 0.6631

Region χ2 12 16.325 0.1768

Table 2. Evaluation of similarity between OULAD and 2015 data for CCC module.

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 6

Page 8: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

Figure 5. Comparison of attributes of OULAD (red) and testing data (blue) from year 2015. We canobserve slight differences between OULAD CCC module and corresponding data from the year 2015. Most ofthem are caused by a different number of students registered for module each year. The largest difference canbe observed in IMD, which is expected since IMD is correlated with the postcode of student home, whichchanges rapidly.

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 7

Page 9: Open Research Onlineoro.open.ac.uk/53873/1/Untitled.pdf · Students in a module-presentation are organised into study groups of approximately 20 people. Each group has an assigned

on type on attribute) and applied the selected test on attributes from the studentInfo table. The nullhypothesis of the test is whether the distributions of 2013/4 and 2015 data are the same against thealternative hypothesis that distributions are different. To support the results of statistical testing we alsoproduced histograms for visual comparison.

Summary results are depicted in Table 2. We can observe that P-value ranges from 0.15 to 0.93concluding that there is no statistically significant difference between distributions of OULAD and testingdata. This leads to the conclusion that dataset is still actual and reflect the current student population well.

Figure 5 presents the comparison of attributes of module CCC from the OULAD dataset and the datafrom the year 2015. We can observe slight differences in data. Most of them are caused by a differentnumber of students studying each year. What is clear from the picture is that the distributions ofparameters are similar. The largest difference (in the histogram) can be observed in IMD. This variable iscorrelated with the postcode of student home, thus it changes rapidly. Nevertheless, the statistics show,that there is no significant difference between OULAD and the testing data, leading to conclusion that thedataset is still up-to-date regarding the student population.

Usage NotesWe recommend the user to visit the section Examples on OULAD webpage, which contains an exampleof the dataset usage. The OULAD dataset is also available from the UCI machine learning repository(http://archive.ics.uci.edu/ml/) as a CSV file and from GitHub(http://github.com/) as a R package, whichcan be installed directly into R using the command: devtools::install_github(‘jakubkuzilek/oulad’).

References1. Papamitsiou, Z. & Economides, A. A. Learning Analytics and Educational Data Mining in Practice: A Systematic Literature Reviewof Empirical Evidence. Educational Technology & Society 17, 49–64 (2014).

2. Stamper, J., Niculescu-Mizil, A., Ritter, S., Gordon, G. J. & Koedinger, K. R. Algebra I 2008-2009. Challenge data set from KDDCup 2010 Educational Data Mining Challenge (2010).

3. Cao, L. & Zhang, C. KDD Cup 2015—Predicting dropouts in MOOC (2015).4. Iyengar, V. S. Transforming data to satisfy privacy constraints. Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining. ACM 279-288 (2002).

5. Smith, T. et al. The English Indices of Deprivation 2015: Technical Report (Department for Communities and Local Government, 2015).6. Prasser, F., Kohlmayer, F., Lautenschläger, R. & Kuhn, K. A. Arx-a comprehensive tool for anonymizing biomedical data. AMIAAnnual Symposium Proceedings 984, 984–993 (2014).

7. Greenwood, P. E. & Nikulin, M. S. A guide to chi-squared testing 280. (John Wiley & Sons, 1996).8. Higgins, J. J. Introduction to modern nonparametric statistics (Cengage Learning, 2003).

Data Citation1. Kuzilek, J., Hlosta, M. & Zdrahal, Z. figshare https://doi.org/10.6084/m9.figshare.5081998.v1 (2017).

AcknowledgementsOur thanks to The Open University for giving us the opportunity to release this dataset. Especially toProfessor John Domingue, Director of KMi.

Author ContributionsJ.K. and M.H. selected the data and prepared the dataset for publication. J.K. performed the datasetanonymisation. Z.Z. evaluated the dataset and tested it for errors. All authors wrote the manuscript.

Additional InformationCompeting interests: The authors declare no competing financial interests.

How to cite this article: Kuzilek, J. et al. Open University Learning Analytics dataset. Sci. Data 4:170171doi: 10.1038/sdata.2017.171 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published mapsand institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 Interna-tional License, which permits use, sharing, adaptation, distribution and reproduction in any

medium or format, as long as you give appropriate credit to the original author(s) and the source, provide alink to the Creative Commons license, and indicate if changes were made. The images or other third partymaterial in this article are included in the article’s Creative Commons license, unless indicated otherwise ina credit line to the material. If material is not included in the article’s Creative Commons license and yourintended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtainpermission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files made available in this article.

© The Author(s) 2017

www.nature.com/sdata/

SCIENTIFIC DATA | 4:170171 | DOI: 10.1038/sdata.2017.171 8