Descriptive Methods - KTI – Knowledge Technologies...

Descriptive Methods

707.031: Evaluation Methodology Winter 2015/16

Eduardo Veas

what we do with the data depends on the scales

Measurement Scales

The complexity of measurements

• Nominal

• Ordinal

• Interval

• Ratio

Sophisticated

Nominal data

• arbitrarily assigning a code to a category or attribute: postal codes, job classifications, military ranks, gender

• mathematical manipulations are meaningless

• mutually exclusive categories

• each category is a level

• use: freq, counts, 5

Ordinal data

• ranking of an attribute

• interval between points in scale not intrinsically equal

• comparisons < or > are possible

Interval data

• equal distances between adjacent values, but no absolute zero

• temperature in C or F

• mean can be computed

• Likert scale data ?7

• absolute zero

• can be operated mathematically

• time to complete, distance or velocity of cursor,

• count, normalized count (count per something)

Frequencies

Title Text

Frequency tables

• tab.courses<-as.data.frame(freq(ordered(courses)), plot=FALSE)

• CumFreq= cumsum(tab.courses[-dim(tab.courses)[1],]$Frequency)

• tab.courses$CumFreq=c(CumFreq,NA)• tab.courses

Interpreting frequency tables

Frequency Percent CumPercent CumFreq1 2 20 20 22 3 30 50 53 4 40 90 94 1 10 100 10Total 10 100 NA NA

Contingency Tables

Right-handed Left-handed Total

Males 43 9 52

Females 44 4 48

Totals 87 13 100

Modelling

Statistical models

• A model has to accurately represent the real world phenomenon.

• A model can be used to predict things about the real world.

• The degree to which a statistical model represents the data collected is called fit of the model

Frequency distributions

• plot observations on the x-axis and a bar showing the count per observation

• ideally observations fall symmetrically around the center

• skew and kurtosis describe abnormalities in the distributions

Histogram / Frequency distributions

Center of a distribution

• Mode: score that occurs most frequently in the dataset• it may take several values• it may change dramatically with a single added score

• Median: is the middle score (after ranking all scores)• for even nr of scores, add centric values and divide by

2 • good for ordinal, interval and ratios

• Mean: average score• can be influenced by extreme scores17

Dispersion of a distribution

• range: difference between lowest and highest score

• interquartile difference: mode + upper and lower quartiles

252 - 22 = 232 121 - 22 = 99

Fit of the mean

• deviance: mean - x

• sum of squared errors (SS)

• variance = SS / N-1

• stddev = sqrt(variance)19

Assumptions

Assumptions of parametric data

• normally distributed: sample or error in the model

• homogeneity of variance: • correlational: variance of one variable should be stable at all

levels of the other variable• groups: each sample comes from a population with same

variance

• interval data: at least interval data

• independence: the behaviour of one participant does not influence that of another21

Distributions for DLF

0 1 2 3 4Hygiene score on day1

Density

0 1 2 3Hygiene score on day 2

Density

0 1 2 3Hygiene score on day 3

Density

-2 0 2theoretical

sample

-3 -2 -1 0 1 2 3theoretical

sample

-2 -1 0 1 2theoretical

sample

Quantify normallity

Different groups

Exam histogram

25 50 75 100exam

density

Exam histogram

25 50 75 100exam

density

10 20 30 40 50 60 70exam

density

60 70 80 90 100exam

density

Shapiro-Wilk test

• # Shapiro-Wilk• shapiro.test(rexam$exam)

• • #if we are comparing groups, what is important

is the normallity within each group• by(rexam$exam, rexam$uni, shapiro.test)

Reporting Shapiro-Wilk

• A Shapiro-Wilk test on the R exam, W=0.96, proved a significant deviation from normality (p<0.05).

Homogeneity of variance

• Levene’s test:• leventTest(rexam$exam, rexam$uni,

center=mean)

• Reporting: for the percentage on the R exam, the variances were similar for KFU and TUG students, F(1,98)=2.09

Homogeneity of variance

• Levene in large datasets may give sig for small variations

• Double check Variance ratio (Hartley’s Fmax)

Correlations

Title Text

Everything is hard to begin with, but the more you practise the easier it gets

Relationships

• Everything is hard to begin with, but the more you practise the easier it gets

• increase in practice, increase in skill

• increase in practice, but skill remains unchanged

• increase in practice, decrease in skill33

Correlations

• Bivariate: correlation between two variables

• Partial: correlation between two variables while controlling the effect of one or more additional variables

Covariance

• are changes in one variable met with similar changes in the other variable

• cross product deviations= multiply deviations of the two variables

• covariance= CPD / (N-1)

Covariance II

• Positive: both variables vary in the same direction

• Negative: variables vary in opposite directions

• Covariance is scale dependent and cannot be generalized

Pearson correlation coefficient

• cov/sxsy

• Data must be at least interval

• Value between -1 and 1

• 1 -> variables positively correlated• 0 -> no linear relationship• -1 -> variables negatively correlated

Dataset Exams and Anxiety

• effects of exam stress and revision on exam performance

• questionnaire to assess anxiety relating to exams (EAQ)

Enter data

• examData<-read.delim("ExamAnxiety.dat", header=TRUE)

• examData2<-examData[,c(“Exam”,"Anxiety","Revise")]

• cor(examData2)

Pearson correlation

• Exam Anxiety Revise• Exam 1.0000000 -0.4409934 0.3967207• Anxiety -0.4409934 1.0000000 -0.7092493• Revise 0.3967207 -0.7092493 1.0000000

Confidence values

• rcorr(as.matrix(examData[,c(“Exam","Anxiety","Revise")]))

• Exam Anxiety Revise• Exam 0 0 • Anxiety 0 0 • Revise 0 0

Reporting Pearson’s CC

A Pearson correlation coefficient indicated a significant correlation between anxiety performance and time spent revising, r=-.44, p<0.01

Spearman’s correlation coefficient

• non parametric test

• first rank the data and then apply Pearson cc

Liar Dataset

• contest for storytelling the biggest lie

• 68 participants, ranking, and creativity questionnaire

Spearman test

• liarData=read.delim("biggestLiar.dat", header=TRUE)

• rcorr(as.matrix(liarData[,c(“Position","Creativity")]))

• Position Creativity• Position 1.00 -0.31• Creativity -0.31 1.00

Reporting spearman

A Spearman non-parametric correlation test indicated a significant correlation between creativity and ranking in the world’s biggest liar contest, r=-.37, p<0.001

Kendall’s tau non-parametric

• used for small datasets

• cor.test(liarData$Position, liarData$Creativity, alternative="less", method="kendall")

• z = -3.2252, p-value = 0.0006294• alternative hypothesis: true tau is less than 0• sample estimates:• tau • -0.3002413 47

Reporting Kendall’s test

A Kendall tau correlation coefficient indicated a correlation between creativity and performance in the World’s biggest liar contest, t=-.30, p<0.001

Biserial and point-biserial correlations

• one variable is dichotomous (categorical with 2 categories)

• point biserial: for discrete dichotomy (e.g., dead)

• biserial: for continuous dichotomy (e.g., pass exam)

Readings

• Discovering statistics using R (Andy Field, Jeremy Miles, Zoe Field)

Title Text

set work directory

• setwd("/new/work/directory")

• getwd()

• ls() # list the objects in the current workspace

packages

• install.packages(“package.name") #installing packages

• library(package.name) # loading a package

• package::function() # disambiguating functions

Nominal and Ordinal data

• mydata$v1 <- factor(mydata$v1,levels = c(1,2,3),labels = c("red", "blue", “green"))

• mydata$v1 <- ordered(mydata$y,levels = c(1,3, 5),labels = c("Low", "Medium", "High"))

Missing data

• is.na(var) #tests for missing valua/ also in rows

• mydata$v1[mydata$v1==99] <- NA # select rows where v1 is 99 and recode column v1

• x <- c(1,2,NA,3)mean(x) # returns NA mean(x, na.rm=TRUE)

• newdata <- na.omit(mydata) # spawn dataset without missing data55

Install and load packages

• install.packages(“car”); install.packages(“ggplot2”); install.packages(“pastecs”); install.packages(“psych”); install.packages(“descr”)

• library(car);library(ggplot2);library(pastecs);library(psych);library(Rcmdr);library(descr)

Enter data

• id<-c(1,2,3,4,5,6,7,8,9,10)• sex<-c(1,1,1,1,1,2,2,2,2,2)• courses<-c(2.0,1.0,1.0,2.0,3.0,3.0,3.0,2.0,4.0,3.0)• sex<-factor(sex, levels=c(1:2), labels=c("M", "F"))• example<-

data.frame(ID=id,Gender=sex,Courses=courses)

Frequency Distributions

• facebook<-c(22,40,53,57,93,98,103,108,116,121,252)

• library(modeest)• mfv(facebook)

• mean(facebook)

• median(facebook)58

Enter data

• quantile (facebook)

• IQR (facebook)

• var(facebook)

• sd(facebook)

describing your data

• #load meaningful data• lecturer<-read.csv(“lecturerData.csv”,

header=TRUE)

• #get statistics• stat.desc(lecturerData[,c("friends", "income")],

basic=FALSE, norm=TRUE)

describing your data II

• # print frequency table

• tab.friends<-as.data.frame(freq(ordered(lecturerData$friends)), plot=FALSE)

• tab.friends.cumsum<-cumsum(tab.friends[-dim(tab.friends)[1],]$Frequency)

• tab.friends$CumFreq=c(tab.friends.cumsum,NA)• tab.friends

Testing Normally Distributed

• Load DLF data• dlf<-read.delim("DownloadFestival.dat",

header=TRUE)

• Data about hygiene collected during a festival (3days)

Enter data

• hist.day1 <- ggplot (dlf, aes(day1)) + theme(legend.position = "none") + geom_histogram(aes(y = ..density..), colour="black", fill="white")+ labs(x="Hygiene score on day1", y="Density")

• hist.day1 + stat_function(fun = dnorm, args=list(mean=mean(dlf$day1,na.rm=TRUE), sd=sd(dlf$day1, na.rm = TRUE)), colour="blue", size=1)

• qqplot.day1 <-qplot(sample=dlf$day1, stat="qq")63

Plot day 1

0 5 10 15 20Hygiene score on day1

Density

Offending score

• # print bad score• dlf$day1[dlf$day1>5]

• #correct bad score• dlf$day1[dlf$day1>5]<-2.02

-2 0 2theoretical

sample

Quantify normallity

• describe(cbind(dlf$day1, dlf$day2, dlf$day3))

• stat.desc(dlf[,c("day1","day2","day3")p], basic = FALSE, norm= TRUE)

Groups

• rexam<-read.delim("rexam.dat", header=TRUE)

• # obtain statistics for exam, computer, lectures and numeracy• round(stat.desc(rexam[,c("exam","computer","lectures","numer

acy")], basic=FALSE, norm=TRUE), digits=3)

• hist.exam <-ggplot (rexam, aes(exam)) + theme(legend.position = "none") + geom_histogram(aes(y = ..density..), colour="black", fill="white") + labs(x="exam", y="density") + stat_function(fun=dnorm, args=list(mean=mean(rexam$exam,na.rm=TRUE), sd=sd(rexam$exam, na.rm=TRUE)), colour="blue", size=1)67

Add factors

• # set uni to be a factor• rexam$uni <-factor(rexam$uni, levels = c(0:1),

labels = c("KFU", “TUG"))

• by (rexam[, c("exam", "computer", "lectures", "numeracy")], rexam$uni, stat.desc, basic=FALSE, norm = TRUE)

Get subsets and individual histograms

• # now we create subsets of the example datasets for each factor

• kfu<-subset(rexam, rexam$uni=="KFU")• tug<- subset(rexam, rexam$uni==“TUG")

• # now we can create histograms for each subset• hist.exam.kfu <-ggplot (kfu, aes(exam)) +

theme(legend.position = "none") + geom_histogram(aes(y = ..density..), colour="black", fill="white") + labs(x="exam", y="density") + stat_function(fun=dnorm, args=list(mean=mean(kfu$exam,na.rm=TRUE), sd=sd(kfu$exam, na.rm=TRUE)), colour="blue", size=1)69

Descriptive Methods - KTI – Knowledge Technologies...

Documents

Descriptive methods for spatial statistics

Slide 2 Chapter 4 Descriptive Methods in Regression and Correlation

Corpus Methods for Descriptive Translation Studies

Peter A. Rogerson StatiStical MethodS GeoGraphy · 2 DESCRIPTIVE STATISTICS Learning Objectives • Types of Data 26 • Visual Descriptive Methods 27 • Measures of Central Tendency

2018W Research Methods Coursesogpr-educ.sites.olt.ubc.ca/...Research-Methods...6.pdf · methods. The key concepts include data displays, descriptive statistics, variance, standardized

Chapter 3 Descriptive Statistics: Numerical Methods

Research Methods: 1 M.Sc. Physiotherapy/Podiatry/Pain Descriptive statistics

2 CHAPTER 2: Descriptive Statistics: Tabular and Graphical Methods

Research Methods: 2 M.Sc. Physiotherapy/Podiatry/Pain Descriptive statistics

Mathematical Methods in Psychology (II) - Descriptive Statistics

Scientific Method & Descriptive Research Methods Module 5

Chapter 3, Part 1 Descriptive Statistics II: Numerical Methods

Descriptive Research methods

Research Methods in Psychology Descriptive Methods Naturalistic observation Intensive individual case study Surveys/questionnaires/interviews Correlational

Quantitative Research Methods Survey (Descriptive) Correlational Causal-Comparative Experimental

Time Series – Descriptive Methods Series...STATGRAPHICS – Rev. 9/16/2013 2013 by StatPoint Technologies, Inc. Time Series – Descriptive Methods - 3 Data Input The data input

ON SOME DESCRIPTIVE AND PREDICTIVE METHODS FOR THE

Graphical and Tabular Methods in Descriptive Statisticslellings/3342/Notes/1.2.pdf · 2011. 8. 30. · 1! Graphical and Tabular Methods in Descriptive Statistics MATH 3342 Section

DTC Quantitative Research Methods Descriptive Statistics Thursday 16 th October 2014

A COMPARISON OF TWO METHODS FOR GENERATING … · a comparison of two methods for generating descriptive attributes with ... a sensory proﬁle for ﬁsh. ... generating descriptive