Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Making a psychometric Dr Benjamin Cowan- Lecture 9
What this lecture will cover
What is a questionnaire?
Development of questionnaires
Item development
Scale options
Scale reliability & validity
Factor Analysis
What is a questionnaire?
Some concepts are difficult to measure directly using measurements like time, accuracy etc
Attitudes, emotions,
opinions
We need to design
psychometrics for these if we are to research them
Why would we want to make a psychometric?
If we are looking at a new concept that hasn’t
been measured before
Happens a lot in HCI with developments of new technologies
Because a metric needs to measure something
specific for it to have value, we need to design or
tweak existing measures for new technologies
Need to add items and re-test
Example- Anxiety towards facebook posting
Let’s say we wanted to make a measure of how anxious people were about posting to facebook
This measure (our questionnaire) is made of attitude phrases (or items).
Stages of item development
Literature review
What are the key concepts in studying anxiety?
Measure review
What is available? How is anxiety currently measured?
Focus groups/interviews
What is important in facebook anxiety?
Questions about facebook and negative emotions
Gives an indication of how people describe the
concepts, thus improving item wording
Generating Items- Interviews
Conversation with a
purpose
4 main types
Unstructured
Semi Structured
Structured
Group
Unstructured interviews
Exploratory
Talk around an area
Planning the areas for discussion rather than
specific questions
Can explore topics as they come up
Structured Interviews
Predetermined questions
Standardised for all interviewees
Semi Structured Interviews
Basic script used with all participants
Mix of Structured and Unstructured Interview
There are some questions that are covered with
all and the rest is a free flowing conversation
What interview type to use?
Depends on:
How specific you need to get
Purpose of the interview
Stages of item development
This will allow you to get an idea of:
Potential items
Potential categories that need to be covered (factors)
Pilot study
Large number of items
Participants rate:
Clarity of wording
Clarity of concept in the item
Experts in the area to review items
The good, the bad, the ugly
Good item
Clear, well worded, one concept, to the point.
“I feel stressed when using facebook”
Bad item
Can be clearly worded but does not cover one concept
“I feel stressed because of so many people on facebook and it
is hard to use”
Ugly item
Poorly worded and doesn’t cover one concept
“Stress is something I feel all of the time when using facebook
because people on it are plentiful and it’s difficult”
This can happen when questionnaires are mis-translated.
Common scales used
Likert Scales (Likert, 1926) 3 point, 5 point, 7 point, 9 point
More points, the larger the variance of responses on item
Arguments over which is best but 5 point is most common
The use of a “neutral point” is also debated
Semantic Differential
Uses two polar opposite adjectives at the end of a scale
Which to use?
Strong-Not Strong (bad)
Strong- Weak (good)
Important concepts in item response
Response Acquiescence set
A propensity for participants to answer positively to
items
Balancing psychometric as much as possible
(positively and negatively worded items)
Item Randomisation
Social Desirability
Responding with what you feel is socially
appropriate
So……
We have our items
We have piloted them with participants
We now need to assess how good our
questionnaire (or psychometric) is
Good psychometrics have:
High reliability
High validity
Possess a set of norms (baselines/guides)
Reliability
Stability of the test score over time
Test-Retest Reliability
Internal consistency of the test
Internal consistency reliability
The extent to which the items are measuring the
same underlying concept
Test-Retest Reliability
Testing same participants on the measure on two
occasions
Scores are then correlated to see strength of
relationship
Over 0.7 is good test- retest reliability
Test at
Time 1
Test at
Time 2
6 month
gap
Why would the correlation not be perfect?
Between times there may be changes on the
variables
Some people may have become less anxious over time
Test Error
N feeling ill, bored, tired.
Internal consistency reliability
The extent to which each item measures the
same underlying concept
In our facebook posting anxiety scale we would
expect all the items to be
measuring elements of anxiety
not measuring usability of facebook
Internal consistency measures
Split Half method
Divide measure in two randomly and correlate the
scores on the two halves together
Cronbach alpha (most commonly used)
Average correlation of all possible split half correlations.
0.7 seen as a good alpha
What can impact on this reliability
The number of items
More items mean more of concept can be covered
Weighing up number of items and boredom
10 items considered minimum for reliable test
Can a measure be too internally consistent?
(Cattell, 1957)
Using items which effectively measure the same thing
E.g. “I like facebook” and “Facebook is something I like”
They are the same item, just different wording
Leads to a “bloated specific”
Cronbach alpha analysis
The analysis looks at all correlations of the item
scores with the total questionnaire score (item-
total correlations)
Items with Item-total correlations of lower than 0.3
should be removed as they do not correlate well
The test output also gives us an idea of what
alpha would be without each item- great for
item removal
Validity of a test
A test can be reliable but not valid
It could be high in reliability but not measuring what
it proclaims to measure
It is not as simple as looking at the item wordings
to deduce this
We need to identify whether our measure
behaves as predicted
Validity Assessment
Face validity
The items seem to be worded right for the concept
being measured
This is a poor test of validity
E.g. “I am quite easily distracted”- looks fine but can be interpreted differently by participants
Concurrent Validity
Correlation of test with other benchmark test that
was given at the same time
Dubious when there is no clear benchmark
Validity Assessment
Predictive Validity
The measure is able to predict some criterion
E.g. facebook anxiety relates to posting behaviour
Need to be aware that modest relationships are
likely
Many other factors important to posting
behaviour– closeness of facebook friends, drunken
messaging?
Sometimes clear criterions are not available
Beware of the difference between statistical significance and psychological significance
Construct Validity (Cronbach & Meehl, 1955)
Allows a collection of results to lead us to validity
conclusions rather than just one
Usually the case that not all hypotheses are
confirmed
Validity is therefore not as equivocal as reliability
Interpretive and subjective
Construct Validity (Cronbach & Meehl, 1955)
Construct Validity
A bank of hypotheses based on the knowledge of our concept
Our Hypotheses for Facebook anxiety
Should correlate positively and highly with other measures of anxiety (concurrent validity)
Should correlate positively with someone’s fear of negative evaluation (concurrent validity)
Should not correlate with personality tests that don’t measure anxiety
High scorers, compared with low scorers should show less activity on facebook, and more leaving facebook (predictive validity)
Norms
We need to test our measures on
A significant representative proportion of the population (1000’s of respondents)
A sample of people we’d expect to be high or low on the measure (for discriminatory markers)
This is built up over years of use
Now we have
Gathered our items
Assessed their reliability
Assessed their validity
We are assuming at present that facebook anxiety
is uni-dimensional.
This might not be true, there may be many factors
to it, which we have picked up in our measure……
What are factors?
Each questionnaire item gives a score
There will items that correlate heavily together
Factor analysis is fundamentally used to:
reduce the data into the smallest number of explanatory concepts
A factor is a combination of variables, the grouping of which indicates a relationship
What are factors?
Each item has a factor loading
correlation of that item with the factor
Some items will have high loadings, some low or no loading at all on a specific factor
Loadings of 0.4 are seen as helpful in defining a factor
Items should only load heavily on one factor
If they don’t they are candidates for rewording
Shared Variance
Correlation co-efficient represents
The amount of agreement (or shared variance)
between two sets of scores
Square the correlation coefficient to get %
agreement
Variable x
variance
Variable y
variance
Shared
(Common) Variance
Shared Variance & Communality
By squaring the factor loading we can:
Identify how much shared variance there is between
the item and the factor
They can be thought of as the contribution that the
item makes to the factor
If we do this for each factor loading an item has
we get the item’s communality
the amount of variance shared between the item
and all the factors
Factor Extraction
Eigenvalues Indicate the importance of the factor extracted in
explaining the variance in the data
There will be few with high eigenvalues and lots with low
Makes sense to keep the most important factors
Rule of thumb is keep factors with eigenvalues > 1 (as an eigenvalue of 1 represents a significant amount of variation).
The number to extract is identified using a Scree Plot (Cattell, 1966) Y axis is eigenvalues
X axis is the number of factors
Scree Plot Eig
en
va
lue
s
Number of Factors
Point of Inflexion
Factor Rotation
Looking for “best fit”- factor structure with
clearest interpretation
Sometimes this involves rotation to get the
clearest, simplest factor structure
A simple factor structure is one that has a few
high loading items and the rest being near 0
(Cattell, 1978)
Methods of Rotation
The method you choose depends on how correlated you feel the factor scores should be
Based on theoretical reasoning
We would expect our questionnaire
To have factors- 1) anxiety about social posting, 2) anxiety about interface interaction, 3) social confidence
For the scores from this to be correlated
Methods of Rotation
We would therefore use a
method that takes this correlation into
consideration- Direct Oblimin
This is an oblique method of rotation (allows the factors
to correlate)
Methods of Rotation
If we felt they should not
correlated then we could have used Varimax method
This is an example of
orthogonal rotation- ensures
the extracted factors are not correlated.
Considerations
Sample size
Number of people in the sample debated
100 for stable factors (Kline, 1999)
Using Factor Analysis in questionnaire construction
Give participants questionnaire
Conduct factor analysis
Any that load highly on more than one factor, check for concept clarity
Check that all those with loadings >0.3 cover the most of what we need in the scale, if not write more items
Replicate this on each new sample
Validate the scale factors and calculate their
reliability
Making a psychometric
Takes a lot of time
To develop the items
To test on wide range of samples
To test a large bank of hypotheses on relationships to
ensure its validity
Sometimes it cannot be avoided
Readings
Kline, P. (2000). A Psychometrics Primer, Chapter
3. Free Association Books- £14.95 from Amazon
Kline (1994). An easy guide to factor analysis
(available in library)
Field, A. (2007).Chapter 15- Exploratory Factor
Analysis