How to use R-stat? – a simple guideline.
Prepared by: Krishna DhakalAcademic level: M.Sc.Ag
Department : Genetics and Plant BreedingDate of final work: March 2, 2016
Agriculture and Forestry University, Chitwan, Nepal
INTRODUCTION TO R It is an elegant, object-oriented programming language
R is an integrated suite of software facilities for data
manipulation, simulation, calculation and graphical display
It handles and analyzes data very effectively and it contains
a suite of operators for calculations on arrays and matrices
R is available in Windows and Macintosh versions, as well
as in various flavors of Unix and Linux
Continued……… It is currently maintained by the R Core development team – a
hard-working, international group of volunteer developers The R project web page is http://www.r-project.org For downloading the software directly Go to http://cran.us.r-project.org/ The R project was started by Robert Gentleman and Ross Ihaka
(that’s where the name “R” is derived) from the Statistics Department in the University of Auckland in 1995
Lacking in R
It has a limited graphical interface (S-Plus has a good one).
This means, it can be harder to learn at the outset
The command language is a programming language so
students must learn to appreciate syntax issues etc.
Starting and quitting with R
First of all download the latest version of R(zip file)
Install in your PC
And the icon of R will appear on your desktop
Double click on it………….
The view when R opens
Continued.. When R is started, the program’s “Gui” (graphical user
interface) window appears
Under the opening message in the R Console is the > (“greater
than”) prompt
At the > prompt, you tell R what you want it to do
Continued…….
You give R a command and R does the work and gives the
answer
If your command is too long to fit on a line or if you submit
an incomplete command, a “+” is used for the continuation
prompt To quit R, type q() or use the Exit option in the File menu
The useful tips While typing instructions in R, you can save yourself a lot of
typing when you learn to use the arrow keys effectively Each command you submit is stored in the History and the up
arrow (↑) will navigate backwards along this history and the down arrow (↓) forwards
The left (←) and right arrow (→) keys move backwards and forwards along the command line
These keys combined with the mouse for copying, cutting/pasting can make it very easy to edit and execute previous commands
The workspace All variables or “objects” created in R are stored in what’s
called the workspace To see what variables are in the workspace, you can use the
function ls() to list them (this function doesn’t need any argument between the parentheses)
To remove objects from the workspace (you’ll want to do this occasionally when your workspace gets too cluttered), use the rm() function
In Windows, you can clear the entire workspace via the “Remove all objects” option under the “Misc” menu
Continued.. When exiting R, the software asks if you would like to save
your workspace image If you click yes, all objects (both new ones created in the
current session and others from earlier sessions) will be available during your next session
If you click no, all new objects will be lost and the workspace will be restored to the last time the image was saved
Get in the habit of saving your work – it will probably help you in the future
Installing packages R is provided with lots of packages, always use reliable and
proven packages, since R does not give guarantee on misuse
Based on the field of your study you have to choose
packages accordingly
For agriculturist packages like lme4, agricolae, lmerTest,
MASS, car etc.
if you have downloaded the packages separately then you
can install it by the following procedure
Continued…… Go to packages(at the top of R screen)- click on “install
packages from local zip files”- choose the zip file and click open
If you don’t have downloaded zip files then you can download it all online
For online install- go to “packages”- click on “install packages”- choose the packages and download them
R is a sea of programs, if you know how to swim you will find everything that is needed for you, what you need is to explore yourself
Analysis of data with R(procedure) During data sheet preparation in excel always use abbreviated form
and always note its full form
dm-days to maturity, ht-plant height, bms-biomass, gps-grain per
spike, gy- grain yield, tw- test weight
Now, Convert the excel file into csv file
Go to menu on excel, click "save as" and choose "csv” (comma
delimited)" and give a short name and remember it
Make a new folder and place the csv file into it(either in C or D
drive whichever you prefer)
Continued.. Now open R and start your job Firstly, get working directory as giving “getwd()” and enter Set working directory : type setwd(“D:/assignment”) and inside
bracket put a inverted double comma and give directory, in example "D" is a drive in which there is a folder named assignment
Uploading data in R: eg. mod=read.csv("heat.csv",header=T), here “mod” is a given name; you can put any, and heat.csv is csv file name you should put yours(exactly the same name without neglecting upper case and lower case letters), read.csv is a command for reading the file data in R screen
When you type mod and enter then, all your data appears on your screen
Continued……. If there is missing data then just leave it vacant in respective
place, if you put it as "0" in data sheet later it will show error in process of calculation
Setting of factors: replication, block and entry(genotypes) should be taken as factors and others are variables
Example. REP=as.factor(mod$rep), here, “REP”- name given, you can use any of your choice, “mod”- it came from(mod=read.csv("heat.csv",header=T)), “rep”- is what used in excel sheet to denote data of replication
BLOCK=as.factor(mod$block),ENTRY=as.factor(mod$ent ry)
Missing value of Data
Continued……. yield, height, biomass, grain per spike, test weight etc. are
variables, in these cases what we do is: HT=(mod$ht),
DM=(mod$dm), GPS=(mod$gps), BMS=(mod$bms),
GY=(mod$gy), TW=(mod$tw)
Making of data frame: example,
Data=data.frame(REP,BLOCK,ENTRY,HT,DM,GPS,BMS,
GY,TW)
Continued …….
Always use the exact names that you have given to the
respective factors and variables
To get summary of your data, perform “summary(data)” and
press “enter”
Data summary will give you the mean, median, and quartiles
values
QQPLOT Qqplot is required to find the distribution of data of a particular
variable
It helps us to find the extreme outliers
For qqPlot , package “car” is required to install
After installation of “car”, give command: require(car) then enter
QqPlot also requires package “lme4” and “lmerTest”
Now, mod=lmer(gy~entry+(1|rep),data)
Then qqPlot(resid(mod)) then enter
You will see the picture on your screen
-2 -1 0 1 2
-10
-50
510
15
norm quantiles
resid(
mod
.ht)
-2 -1 0 1 2
-20
-10
010
2030
40
norm quantiles
resid(
mod
.gps
)
To find extreme outliers in your data
The process is same as in qqPlot
The only difference is you give command
like :plot(resid(mod.ht)) for getting extreme outliers in
height
Similarly, you can get outliers in other data just by
interchanging the command name
In screen you will see the extreme outliers with their
respective entry number
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
-4-2
02
4
Leverage
Stand
ardize
d residu
als
lm(dm ~ entry + rep)
Cook's distance
Residuals vs Leverage
78
77
42
Plotting histogram
To plot histogram you can simply give command: hist(tw)-
this means you want to plot histogram of the test weight
Similarly you can plot histogram of any other variables
which you want
Box plot To produce box plot you can simply perform:
boxplot(gy~rep)
Here the box plot will show the result of grain yield with
respect to the replication
Box plot generally have 5 components the tail regions gives
two extreme values the middle line inside the box gives
median or Q2 value, top part of box shows Q1, bottom part
shows Q3
Correlation
To find out the correlation between the yield and other
variables: cor.test(gy,tw) or any other which you want
Correlation test gives you the value with either positive or
negative correlation
Numerical value and graph
50 100 150 200 250 300 350
7080
9010
011
0
gy
ht
Shapiro test
It is used to see the normality of the variables
Shapiro.test(tw), shapiro.test(gy) etc.
ANOVA from linear model
In this model: eg. analysis=lm(gy~en+rep,data)
Here analysis is a name given to the command, and the gy-
grain yield, in relation to the en- genotype, and replication
Similarly you can give command: anova(analysis) and enter
then you will get your anova
Here analysis is the name given to the command, you can use
on your own
ANOVA from linear mix model Linear mix model is more reliable to get ANOVA then linear
model as it reduces the randomness due to replication To produce anova: mod.ht=lmer(ht~entry+(1|rep),data) Here, mod.ht is a name given, lmer is the function code, ht-
height, entry- genotype, rep- replication, data- from data frame
Similarly you can get anova of other variables just interchanging “ht”. It means that if you want to produce anova of grain yield (gy) then,: mod.gy=lmer(gy~entry+(1|rep),data)
Then type: anova(mod.gy) and enter
Continue…….
Linear mix model is used when the data is obtained from
“RCBD” design
When the design is different, other methods should be used
PBIB test for alpha lattice design If the field is designed according to alpha lattice design then
analysis is to be done by using PBIB test It comes under package “agricolae” And it has the following command modelPBIB=PBIB.test(block,entry,rep,gy,k=12,method="VC
"or"REML",test=“lsd"or"tukey",alpha=0.05,console=T,group=T)
Here, “modelPBIB” is a name given, k- no of plots or treatments in a block, method should be used only one either vc or reml, test may be either lsd or tukey
This command is for grain yield, similarly you can find the value for other variables
Other functions in R The R under “agricolae” offers many functions like AUDPC
analysis, AMMI analysis- for finding G×E interactions As I mention earlier, R is a sea, what you need is to explore
these all To find correlation directly from data frame you just remove
the factors and retain only the variables Eg. data=data.frame (rep,entry,block,gy,ht,dm,tw,bms) Remove rep, entry, block :
data=data.frame(ht,dm,bms,gy,tw) Now give command: plot(data) and enter you will see the
corrrelation
ht
102 106 110 0.6 1.2 1.8 25 35 45
7090
110
102
106
110
dm
gps
2060
0.6
1.2
1.8
bms
gy
5015
030
0
70 90 110
2535
45
20 60 50 150 300
tw
After climbing a great hill, one only finds that there are many more hills to climb.
Nelson Mandela
Thank you for your patience