Making a psychometric - University of Birmingham

Making a psychometric Dr Benjamin Cowan- Lecture 9

What this lecture will cover

What is a questionnaire?

Development of questionnaires

Item development

Scale options

Scale reliability & validity

Factor Analysis

What is a questionnaire?

Some concepts are difficult to measure directly using measurements like time, accuracy etc

Attitudes, emotions,

opinions

We need to design

psychometrics for these if we are to research them

Why would we want to make a psychometric?

If we are looking at a new concept that hasn’t

been measured before

Happens a lot in HCI with developments of new technologies

Because a metric needs to measure something

specific for it to have value, we need to design or

tweak existing measures for new technologies

Need to add items and re-test

Example- Anxiety towards facebook posting

Let’s say we wanted to make a measure of how anxious people were about posting to facebook

This measure (our questionnaire) is made of attitude phrases (or items).

Stages of item development

Literature review

What are the key concepts in studying anxiety?

Measure review

What is available? How is anxiety currently measured?

Focus groups/interviews

What is important in facebook anxiety?

Questions about facebook and negative emotions

Gives an indication of how people describe the

concepts, thus improving item wording

Generating Items- Interviews

Conversation with a

purpose

4 main types

Unstructured

Semi Structured

Structured

Group

Unstructured interviews

Exploratory

Talk around an area

Planning the areas for discussion rather than

specific questions

Can explore topics as they come up

Structured Interviews

Predetermined questions

Standardised for all interviewees

Semi Structured Interviews

Basic script used with all participants

Mix of Structured and Unstructured Interview

There are some questions that are covered with

all and the rest is a free flowing conversation

What interview type to use?

Depends on:

How specific you need to get

Purpose of the interview

Stages of item development

This will allow you to get an idea of:

Potential items

Potential categories that need to be covered (factors)

Pilot study

Large number of items

Participants rate:

Clarity of wording

Clarity of concept in the item

Experts in the area to review items

The good, the bad, the ugly

Good item

Clear, well worded, one concept, to the point.

“I feel stressed when using facebook”

Bad item

Can be clearly worded but does not cover one concept

“I feel stressed because of so many people on facebook and it

is hard to use”

Ugly item

Poorly worded and doesn’t cover one concept

“Stress is something I feel all of the time when using facebook

because people on it are plentiful and it’s difficult”

This can happen when questionnaires are mis-translated.

Common scales used

Likert Scales (Likert, 1926) 3 point, 5 point, 7 point, 9 point

More points, the larger the variance of responses on item

Arguments over which is best but 5 point is most common

The use of a “neutral point” is also debated

Semantic Differential

Uses two polar opposite adjectives at the end of a scale

Which to use?

Strong-Not Strong (bad)

Strong- Weak (good)

Important concepts in item response

Response Acquiescence set

A propensity for participants to answer positively to

items

Balancing psychometric as much as possible

(positively and negatively worded items)

Item Randomisation

Social Desirability

Responding with what you feel is socially

appropriate

So……

We have our items

We have piloted them with participants

We now need to assess how good our

questionnaire (or psychometric) is

Good psychometrics have:

High reliability

High validity

Possess a set of norms (baselines/guides)

Reliability

Stability of the test score over time

Test-Retest Reliability

Internal consistency of the test

Internal consistency reliability

The extent to which the items are measuring the

same underlying concept

Test-Retest Reliability

Testing same participants on the measure on two

occasions

Scores are then correlated to see strength of

relationship

Over 0.7 is good test- retest reliability

Test at

Time 1

Test at

Time 2

6 month

gap

Why would the correlation not be perfect?

Between times there may be changes on the

variables

Some people may have become less anxious over time

Test Error

N feeling ill, bored, tired.

Internal consistency reliability

The extent to which each item measures the

same underlying concept

In our facebook posting anxiety scale we would

expect all the items to be

measuring elements of anxiety

not measuring usability of facebook

Internal consistency measures

Split Half method

Divide measure in two randomly and correlate the

scores on the two halves together

Cronbach alpha (most commonly used)

Average correlation of all possible split half correlations.

0.7 seen as a good alpha

What can impact on this reliability

The number of items

More items mean more of concept can be covered

Weighing up number of items and boredom

10 items considered minimum for reliable test

Can a measure be too internally consistent?

(Cattell, 1957)

Using items which effectively measure the same thing

E.g. “I like facebook” and “Facebook is something I like”

They are the same item, just different wording

Leads to a “bloated specific”

Cronbach alpha analysis

The analysis looks at all correlations of the item

scores with the total questionnaire score (item-

total correlations)

Items with Item-total correlations of lower than 0.3

should be removed as they do not correlate well

The test output also gives us an idea of what

alpha would be without each item- great for

item removal

Validity of a test

A test can be reliable but not valid

It could be high in reliability but not measuring what

it proclaims to measure

It is not as simple as looking at the item wordings

to deduce this

We need to identify whether our measure

behaves as predicted

Validity Assessment

Face validity

The items seem to be worded right for the concept

being measured

This is a poor test of validity

E.g. “I am quite easily distracted”- looks fine but can be interpreted differently by participants

Concurrent Validity

Correlation of test with other benchmark test that

was given at the same time

Dubious when there is no clear benchmark

Validity Assessment

Predictive Validity

The measure is able to predict some criterion

E.g. facebook anxiety relates to posting behaviour

Need to be aware that modest relationships are

likely

Many other factors important to posting

behaviour– closeness of facebook friends, drunken

messaging?

Sometimes clear criterions are not available

Beware of the difference between statistical significance and psychological significance

Construct Validity (Cronbach & Meehl, 1955)

Allows a collection of results to lead us to validity

conclusions rather than just one

Usually the case that not all hypotheses are

confirmed

Validity is therefore not as equivocal as reliability

Interpretive and subjective

Construct Validity (Cronbach & Meehl, 1955)

Construct Validity

A bank of hypotheses based on the knowledge of our concept

Our Hypotheses for Facebook anxiety

Should correlate positively and highly with other measures of anxiety (concurrent validity)

Should correlate positively with someone’s fear of negative evaluation (concurrent validity)

Should not correlate with personality tests that don’t measure anxiety

High scorers, compared with low scorers should show less activity on facebook, and more leaving facebook (predictive validity)

Norms

We need to test our measures on

A significant representative proportion of the population (1000’s of respondents)

A sample of people we’d expect to be high or low on the measure (for discriminatory markers)

This is built up over years of use

Now we have

Gathered our items

Assessed their reliability

Assessed their validity

We are assuming at present that facebook anxiety

is uni-dimensional.

This might not be true, there may be many factors

to it, which we have picked up in our measure……

What are factors?

Each questionnaire item gives a score

There will items that correlate heavily together

Factor analysis is fundamentally used to:

reduce the data into the smallest number of explanatory concepts

A factor is a combination of variables, the grouping of which indicates a relationship

What are factors?

Each item has a factor loading

correlation of that item with the factor

Some items will have high loadings, some low or no loading at all on a specific factor

Loadings of 0.4 are seen as helpful in defining a factor

Items should only load heavily on one factor

If they don’t they are candidates for rewording

Shared Variance

Correlation co-efficient represents

The amount of agreement (or shared variance)

between two sets of scores

Square the correlation coefficient to get %

agreement

Variable x

variance

Variable y

variance

Shared

(Common) Variance

Shared Variance & Communality

By squaring the factor loading we can:

Identify how much shared variance there is between

the item and the factor

They can be thought of as the contribution that the

item makes to the factor

If we do this for each factor loading an item has

we get the item’s communality

the amount of variance shared between the item

and all the factors

Factor Extraction

Eigenvalues Indicate the importance of the factor extracted in

explaining the variance in the data

There will be few with high eigenvalues and lots with low

Makes sense to keep the most important factors

Rule of thumb is keep factors with eigenvalues > 1 (as an eigenvalue of 1 represents a significant amount of variation).

The number to extract is identified using a Scree Plot (Cattell, 1966) Y axis is eigenvalues

X axis is the number of factors

Scree Plot Eig

en

va

lue

s

Number of Factors

Point of Inflexion

Factor Rotation

Looking for “best fit”- factor structure with

clearest interpretation

Sometimes this involves rotation to get the

clearest, simplest factor structure

A simple factor structure is one that has a few

high loading items and the rest being near 0

(Cattell, 1978)

Methods of Rotation

The method you choose depends on how correlated you feel the factor scores should be

Based on theoretical reasoning

We would expect our questionnaire

To have factors- 1) anxiety about social posting, 2) anxiety about interface interaction, 3) social confidence

For the scores from this to be correlated

Methods of Rotation

We would therefore use a

method that takes this correlation into

consideration- Direct Oblimin

This is an oblique method of rotation (allows the factors

to correlate)

Methods of Rotation

If we felt they should not

correlated then we could have used Varimax method

This is an example of

orthogonal rotation- ensures

the extracted factors are not correlated.

Considerations

Sample size

Number of people in the sample debated

100 for stable factors (Kline, 1999)

Using Factor Analysis in questionnaire construction

Give participants questionnaire

Conduct factor analysis

Any that load highly on more than one factor, check for concept clarity

Check that all those with loadings >0.3 cover the most of what we need in the scale, if not write more items

Replicate this on each new sample

Validate the scale factors and calculate their

reliability

Making a psychometric

Takes a lot of time

To develop the items

To test on wide range of samples

To test a large bank of hypotheses on relationships to

ensure its validity

Sometimes it cannot be avoided

Readings

Kline, P. (2000). A Psychometrics Primer, Chapter

3. Free Association Books- £14.95 from Amazon

Kline (1994). An easy guide to factor analysis

(available in library)

Field, A. (2007).Chapter 15- Exploratory Factor

Analysis

Documents

Making a psychometric - University of Birmingham