1
Reproducible Statistics course for the future, from Stata to R Kennedy Mwai 1 , Amos Thairu 1 Tabitha Mwangi 2 , Greg Fegan 1,3 1. KEMRI-Wellcome Trust Research Programme, Kilifi 2. Pwani University, Kenya 3. Nuffield Department of Clinical Medicine, Centre for Tropical Medicine & Global Health, University of Oxford @kenniajin URL: www.keniajin.com Introduction Approach Keywords: Training, Reproducibility, RStudio , Git , RMarkdown Reproducibility being laudable and frequently called for, we should be instilling this practice in students before they set out to do research 1 . The maturity and extensive reproducibility abilities of Git, R and RStudio based materials make an excellent choice for professional statistical skills training 2,3 . Using other major statistical software's for training can be challenging when it comes to creating reproducible courses for universities and training institutions. The popularity of R and RStudio as tools for statistical programming is rising. We utilised the above capabilities to convert a statistical methodology for the design and analysis of epidemiological studies (SMDAES) the course from a STATA® based to an R based reproducible course. -*.do to *.R files -*.dta to csv read.dta() -*.ppt to *.RMD or *.RNW *.doc to *.RMD or *.RNW knitr/rmarkdown STATA -*.do -*.dta -*.doc .*.ppts RStudio and R conversion Local Git Repository git add . git commit & git push . Online github course repository git pull Pros n Cons Modules References Cons - Initial effort is time consuming - Server setup resources - Stata had a lot of background formatting The course The course mainly covered materials beneficial to students pursuing public health related courses. The course materials were adapted from the STATA® workshops that KEMRI-WTRP have been running in the program for over five years. All the data sets were converted from .dta to .csv formats using foreign package. The codes used in the presentations were converted from the initial .do files(STATA®) to .r files. The course was hosted on GitHub 4 where the facilitators could update the materials from their local or RStudio server repositories. Our RStudio server platform was running on Ubuntu 12.04, with 10gb of RAM. Students and facilitators were instructed to have a username/email before the start of the course so that they could access the RStudio server platform which was linked to the university’s active directory. We had approx. 30 active users during the course period. The week was distributed within a period of 10 days; with 9 days of coursework and a last course wrap up day, covering different R help forums. The facilitators were encouraged to have 15min presentations and then engage the participants R hands on . 1. Stodden, V. & Miguez, S. Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research. (2014). at http ://dx.doi.org/10.6084/m9.figshare.1027488 2. Muenchen, R. A. R for stats the popularity of data analysis software. (2014 (accessed March 27, 2015)). at http ://r4stats.com/articles/popularity 3. Wieczorek, J. Reproducible research, training wheels, and knitr. (2014 (accessed March 30, 2015)). at < http://civilstat.com/2014/02/reproducible-research-training-wheels-and-knitr> 4. Introduction to Statistical Methods using R and R Studio 2015 -- Pwani University at https ://github.com/Keniajin/PwaniR_Training_2015

Reproducible Statistics course for the future, -*.ppt to ... · Keywords: Training, Reproducibility, RStudio , Git ... -*.ppt to *.RMD ... git add . git commit & git push . Online

  • Upload
    vulien

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reproducible Statistics course for the future, -*.ppt to ... · Keywords: Training, Reproducibility, RStudio , Git ... -*.ppt to *.RMD ... git add . git commit & git push . Online

Reproducible Statistics course for the future,

from Stata to RKennedy Mwai1 , Amos Thairu1 Tabitha Mwangi2 , Greg Fegan1,3

1. KEMRI-Wellcome Trust Research Programme, Kilifi

2. Pwani University, Kenya

3. Nuffield Department of Clinical Medicine, Centre for Tropical Medicine & Global Health, University of Oxford

@kenniajin

URL: www.keniajin.com

Introduction

Approach

Keywords: Training, Reproducibility, RStudio , Git , RMarkdown

Reproducibility being laudable and frequently called for, we should be instilling this

practice in students before they set out to do research1. The maturity and extensive

reproducibility abilities of Git, R and RStudio based materials make an excellent

choice for professional statistical skills training2,3.

Using other major statistical software's for training can be challenging when it

comes to creating reproducible courses for universities and training institutions. The

popularity of R and RStudio as tools for statistical programming is rising.

We utilised the above capabilities to convert a statistical methodology for the design

and analysis of epidemiological studies (SMDAES) the course from a STATA® based toan R based reproducible course.

-*.do to *.R files -*.dta to csvread.dta()

-*.ppt to *.RMD or *.RNW*.doc to *.RMD or *.RNW

knitr/rmarkdown

STATA-*.do-*.dta-*.doc.*.ppts RStudio and R

conversion

Local Git Repository

git add .

git commit &

git push .

Online github course repository

git pull

Pros n Cons

Modules

References

Cons

- Initial effort is time consuming

- Server setup resources

- Stata had a lot of background formatting

The course

The course mainly covered materials beneficial to students pursuing public health

related courses. The course materials were adapted from the STATA® workshops that

KEMRI-WTRP have been running in the program for over five years. All the data sets

were converted from .dta to .csv formats using foreign package. The codes used

in the presentations were converted from the initial .do files(STATA®) to .rfiles.The course was hosted on GitHub4 where the facilitators could update the materials

from their local or RStudio server repositories.

Our RStudio server platform was running on Ubuntu 12.04, with 10gb of RAM.

Students and facilitators were instructed to have a username/email before the start

of the course so that they could access the RStudio server platform which was linked

to the university’s active directory. We had approx. 30 active users during the course

period.

The week was distributed within a period of 10 days; with 9 days of coursework and

a last course wrap up day, covering different R help forums. The facilitators were

encouraged to have 15min presentations and then engage the participants R hands

on .

1. Stodden, V. & Miguez, S. Best Practices for Computational Science:Software Infrastructure and Environments for Reproducible and Extensible Research. (2014).at http://dx.doi.org/10.6084/m9.figshare.1027488

2. Muenchen, R. A. R for stats the popularity of data analysis software. (2014 (accessed March 27, 2015)). at http://r4stats.com/articles/popularity

3. Wieczorek, J. Reproducible research, training wheels, and knitr. (2014 (accessed March 30, 2015)).at <http://civilstat.com/2014/02/reproducible-research-training-wheels-and-knitr>

4. Introduction to Statistical Methods using R and R Studio 2015 -- Pwani University at https://github.com/Keniajin/PwaniR_Training_2015