35
Cluster Computing Data Preparation EEGtoolkit (R) Karim Malki

Cluster Computing Data Preparation EEGtoolkit (R) Karim Malki

Embed Size (px)

Citation preview

Cluster ComputingData Preparation

EEGtoolkit (R)

Karim Malki

AgendaCluster Computer for batch jobs (where)

•Introduction•What do I need? •How do I use it? •What it can it do for me?

EEG tool kit in R

•R for EEG? •R for parallelising tasks•R for graphics

What is cluster Computing?

A high availability computer cluster consists of a set of connected computers that work together so that they can be viewed as a single system.

Nodes:

What is cluster Computing?

Why Cluster Computing:

As psychology and biology is becoming increasingly data driven we have an unprecedented need to analyze increasingly large-scale, often high dimensional, data.

What can cluster computing help with?

• Running single task many times (powerful PC)• Batch processing - redundancy• Parallelising - load balancing

Cluster Computing

Is cluster computing hard to learn?

No!

Wow… you convinced me. I want a cluster! Where do I get one?

Adopting a Cluster

You already have a nice one!

Cluster Computing

Introduction to High-Performance Scientific Computing

In order to use the cluster properly you will need to know the following:

Unix / Linux command line interface SGE submission methods Use of scp (secure copy) bash scripting

Requirements:

If your workstation is running Linux or Unix, or you use a Mac running MacOS you already have everything you need.

If you are running Windows (on PC or Mac), you will need an X emulator, an ssh terminal, and a secure copy client.

MobaXterm http://mobaxterm.mobatek.net

WinSCP http://winscp.net

Cygwin http://www.cygwin.com

PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

ssssh!

There are a couple of ways that you can access a shell (command line) remotely on most Linux/Unix systems.

• SSH, which is an acronym for Secure Shell• Provide the best security when accessing another

computer remotely. • It encrypt the session• Provides better authentication facilities:

secure file transfer, X session forwarding, port forwarding

Command line navigation

Use standard linux commands! You can practice these using- Terminal – Mac- MobaXterm – Windows

http://www.ee.surrey.ac.uk/Teaching/Unix/

Example:

Let’s try to connect!

ssh -X -l malkik -p51515 -t sgdpgw.iop.kcl.ac.uk ssh -X mumak

navigate using cd and ls

Interactivity

You can use the cluster a high-power PC for analysis

For instance you can run single instances of Matlab or R

To work interactively!

You have to be logged into a node!

Check the nodes available using command: qhost

log into a node using qrsh

How to choose the right node?

How many open sessions can I have?

Screen

When you are running and interactive session you want to protect it with the screen command

screen

screen -ls

screen –r

Command: “Ctrl-a” “d”

Distributed ComputingNot all problems are suitable for running on clusters. Computing problems can be broken down into blocks of code that are either serial or parallel in nature. Whilst the serial segments run the same way on both desktops and on a cluster, when you reach the parallel segments, the desktop will continue to run then sequentially whereas a cluster will break them up and run them in parallel. As an example, this image represents a typical program with both serial and parallel segments and how they are executed on both types of system:

You can see that where both systems hit a series of parallel segments, the cluster completesthem in a fraction of the time

Distributed Computing

In order to use parallel computing you need to be able to write (or have someone write) either two bash scripts or code in a way that takes advantage of cluster computing.

Execution ScriptControl Script

Execution code

The code has to be written in a way where certainparts of it can run simultaneously on different nodes and then be reassembled.

Matlab does not parallelize well and has to beinstalled on every node. You will hit the limits ofMatlab quickly. Also, Matlab tends to load as muchof the data as possible.

Windows-based programs will not run on cluster.

Command Script

Contains information for distributed computing:

Includes standard commands such memory needed.

Sets on which nodes the execution script should be distributed.

May contain information such creating directories, writing and reading data from different locations

Grid Engine$JOB_ID The job id assigned by SGE when a job is submitted$JOB_NAME The job name. If none is specified when the job is submitted,JOB_NAME is set to either the JOB_ID or the name of the script/program being run$SGE_O_WORKDIR The working directory of the job submission$HOME Refers to your home directory$LOGNAME The username of the user that submits the job$HOSTNAME Hostname of the node running the job$SGE_TASK_IDThe task identifier in an array submission represented by an array instance$SGE_TASK_FIRST Refers to the first job of an array submission$SGE_TASK_LAST Refers to the last job of an array submission

Grid Engine$JOB_ID The job id assigned by SGE when a job is submitted$JOB_NAME The job name. If none is specified when the job is submitted,JOB_NAME is set to either the JOB_ID or the name of the script/program being run$SGE_O_WORKDIR The working directory of the job submission$HOME Refers to your home directory$LOGNAME The username of the user that submits the job$HOSTNAME Hostname of the node running the job$SGE_TASK_IDThe task identifier in an array submission represented by an array instance$SGE_TASK_FIRST Refers to the first job of an array submission$SGE_TASK_LAST Refers to the last job of an array submission

http://sgdpcluster.iop.kcl.ac.uk/SG_Submit.php

Grid Engine#!/bin/bash# File: runmerlin.sh##$ -cwd#$ -j y#$ -S /bin/bash#$ -q all.q#$ -N MerlinRegression$JOB_ID#$ -M [email protected]#$ -m be#$ -o $HOME/op$1.txtmerlin-regress -d judge.dat -p dredd.ped -m megacity1.map --simulate -r$1

Transferring data

You can use a programme like winSCP for windows:

Better use the command line directly.scp sdata.dat [email protected]:/home/usersname/

scp [email protected]:/home/usersname outdata.dta ./

EEG in R -The most popular open source software applications like EEGLAB have been written in the MATLAB programming language (www.mathworks.com).

Several MATLAB-toolboxes have been developed for specific EEGand ERP analysis methods; several of them can be used as EEGLAB plug-ins.

However, since MATLAB is a commercial software, those toolboxes are either not free or if used as free stand-alone applications (created via MATLAB Compiler), they lack most of the scripting possibilities provided bythe MATLAB environment.

Other disadvantages of these toolboxes are their suboptimal performanceregarding CPU-time and memory management, and the relatively low user-friendliness compared to commercial EEG software.

EEG in R -

• is freely available and X- platform Windows, Linux and OS X

• covers all basic steps in the processing of EEG/ERPs (raw data import,

filtering, artifact rejection, segmentation, frequency decomposition,

statistical analyses of single-trial and averaged data, etc.) but can be used

with utility functions which additional packages can rely on

• can handle out-of-memory data and allows easy parallelization

• provides user-friendly workflows but does not restrict power-users in

developing custom scripts

• provides functions for interactive and/or animated plotting

• builds on existing R packages

• Is entirely open source

EEG in R -

It is possible to do most of the pre-processing in EEG Lab or other software and

exporting files to R (or other language such as Python or C) for further analysis.

Several tools in R for the analysis of EEG data already exist including

eegtoolkit

Analysis and visualization tools for electroencephalography (EEG) data

In particular EEG has some reasonable functions to generate simulated EEG

data. This may be useful when evaluating different methods.

Requirements:

A computer Running R (or you can use the cluster)

Install R on your machine: https://cran.r-project.org/

Install the following libraries:

install.packages(‘psych’)

install.packages(‘ggplot2’)

install.packages('eegkit', repos='http://cran.us.r-project.org')

EEG R -

R packages for EEG are essentially a series of functions.

Not as well documented or complete as EEG Lab

More than one package may be needed.

It is not difficult to write function in R BUT you need to understand what you are

doing.

EEG R -

R

library(eegkit)

simple examples using plots (visualise your data anyway you want):

eegcap(electrodes="10-10",type=c("3d","2d"),plotlabels=TRUE,

plotaxes=FALSE,main="",xyzlab=NULL,cex.point=NULL,

col.point=NULL,cex.label=NULL,col.label=NULL,nose=TRUE,

ears=TRUE,head=TRUE,col.head="AntiqueWhite",index=FALSE,

plt=c(0.03,0.97,0.03,0.97) )

EEG R -Independent Component Analysis of EEG Data in R

PRO Can be highly customised

You have access to the algorithms (not a black box)

It is fast

Easy to parallelize.

CON Requires broader knowledge of R to extract meaningful information

User accountability (it is up to you to check what you are doing).

EEG R -Example:

One line of code – (but in isolation not sufficient).

Usage:

eegica(X,nc,center=TRUE,maxit=100,tol=1e-6,Rmat=diag(nc),

type=c("time","space"),method=c("imax","fast","jade"),...)

Check how the algorithms have been coded (and you can change them if you

don’t like it). You don’t agree ICA is good enough? Make it so yourself!

eegica

icajade

EEG R -Arguments

X

Data matrix with n rows (channels) and p columns (time points).

nc Number of components to extract.

center If TRUE, columns of X are mean-centered before ICA decomposition.

maxit Maximum number of algorithm iterations to allow.

tol Convergence tolerance.

Rmat Initial estimate of the nc-by-nc orthogonal rotation matrix.

type Type of ICA decomposition: type="time" extracts temporally independent components, and type="space" extracts

spatially independent components.

method Method for ICA decomposition: method="imax" uses Infomax, method="fast" uses FastICA, and method="jade" uses

JADE.

...

Additional inputs to icaimax or icafast function.

http://rpackages.ianhowson.com/cran/eegkit/man/eegica.html

EEG R - # get "c" subjects of "eegdata" data

data(eegdata)

idx <- which(eegdata$group=="c")

eegdata <- eegdata[idx,]

# get average data (across subjects)

eegmean <- tapply(eegdata$voltage,list(eegdata$channel,eegdata$time),mean)

# remove ears and nose

acnames <- rownames(eegmean)

idx <- c(which(acnames=="X"),which(acnames=="Y"),which(acnames=="nd"))

eegmean <- eegmean[-idx,]

# get spatial coordinates (for plotting)

data(eegcoord)

cidx <- match(rownames(eegmean),rownames(eegcoord))

EEG R - # temporal ICA with 4 components

icatime <- eegica(eegmean,4)

icatime$vafs

quartz()

par(mfrow=c(4,2))

tseq <- (0:255)*1000/255

for(j in 1:4){

par(mar=c(5.1,4.6,4.1,2.1))

sptitle <- bquote("VAF: "*.(round(icatime$vafs[j],4)))

eegtime(tseq,icatime$S[,j],main=bquote("Component "*.(j)),cex.main=1.5)

eegspace(eegcoord[cidx,4:5],icatime$M[,j],main=sptitle)

}

EEG R - # spatial ICA with 4 components

icaspace <- eegica(eegmean,4,type="space")

icaspace$vafs

quartz()

par(mfrow=c(4,2))

tseq <- (0:255)*1000/255

for(j in 1:4){

par(mar=c(5.1,4.6,4.1,2.1))

sptitle <- bquote("VAF: "*.(round(icaspace$vafs[j],4)))

eegtime(tseq,icaspace$M[,j],main=bquote("Component "*.(j)),cex.main=1.5)

eegspace(eegcoord[cidx,4:5],icaspace$S[,j],main=sptitle)

}

Useful linksGetting started:

Level Zero R guide: (Mandatory even if you don’t do EEG)http://hnm.stat.cmu.edu/2014-06%20UVA%20WORKSHOP/00%20PreReads/RTutorial-Level0.pdf

Basic linux commands (any tutorial online will be fine)https://www-uxsup.csx.cam.ac.uk/pub/doc/suse/suse9.0/userguide-9.0/ch24s04.html

Bash Scriptinghttp://www.thegeekstuff.com/2010/03/introduction-to-bash-scripting/

Contact for help or further information:

Karim Malki

MRC Social, Genetic and Developmental Psychiatry CentreInstitute of Psychiatry, King's College LondonPO80 De Crespigny ParkLondon SE5 8AFUK

T: +44 (0)20 7848 0969F: +44 (0)20 7848 [email protected]