Leeds, 22 March 2007
A SECONDARY ANALYSIS OF DATA MID CAREER FELLOWSHIP
Gopalakrishnan NetuveliImperial College London1 Jan 2007 – 31 March 2008
Leeds, 22 March 2007
Treating longitudinal data longitudinally
The objective of this fellowship is to gain experience and proficiency in the secondary analysis of longitudinal data sets.
Motivation Large amount of resources spent on
collecting longitudinal data Inadequate utilisation of the
“longitudinality” of the data
Leeds, 22 March 2007
The substantive research question
The objectives of the fellowship will be pursued by investigating the inter-relationship of trajectories of employment status and health in British Household Panel Survey.
English Longitudinal Study of Ageing will also be used.
In this presentation I present results from the first two months of work done.
Leeds, 22 March 2007
Complexity of longitudinal data
Structure of longitudinal data - like an arrow in flight, which ‘is paradoxically at one stage while also pursuing its path to the target’.
Challenge - resolve that paradox by utilising the whole information included in the stage and the track of the longitudinal variable.
Leeds, 22 March 2007
Importance of trajectories
In the data, trajectories are represented as an array of time indexed variables.
Every point in the trajectory contains information about the current value at the point of measurement the direction
how that point was reached where it might go.
Longitudinal data are underused if only the magnitude of the time- indexed variables are used without taking into account the possible interrelationship between them.
Leeds, 22 March 2007
Examples of research meeting the challenge
Mainland European demographers, used linked information from Norwegian national decennial censuses to construct social trajectories through the life course.
They used state-order-time models to predict mortality:
State model:
logit P(Y = 1) = d + Σ(aiAi + biBi + ciCi)
Order model: logit P(Y = 1) = a + ΣbiOi
-Wunsch et al. 1996
Leeds, 22 March 2007
Combining states and orders: sequences
Order of states can be expressed as sequences, which can represent longitudinal trajectories.
Sequences are clustered according to their similarity in all to all comparisons or in comparisons against ideal types.
Cluster membership is used as dependent or independent variable in analyses.
A method becoming popular for this purpose is optimal matching
Leeds, 22 March 2007
Optimal matching: a short primer
A measure of dissimilarity between two sequences is
d = (L1 + L2) – 2*LCS
Where L1 and L2 are total lengths of first and second sequences and LCS, the length of the largest common sequence they share.
e.g. LONDON LEEDS L1 = 6; L2 = 5; LCS (LD) = 2; d= 11-4 = 7 The matching is optimal – when d has the
smallest possible value, which depend on identifying highest possible LCS
Leeds, 22 March 2007
A special case
If sequences are made of two states and of equal length (L):1. AAAAAA = 111111 Σ1=6 =L2. ABAABA = 101101 Σ2=43. BBAABB = 001100 Σ3=2
Σ is the LCS
d1→2 = (6+6)-2*4 = 4 = 2*(L-Σ2)d1→3 = (6+6)-2*2 = 8 = 2*(L-Σ3)
d2→3 cannot be extrapolated from these relations
d is often standardised by dividing with L (or longer length in case of unequal lengths)
Leeds, 22 March 2007
Developing methods to compare trajectories of two different variables: progress to-date
I used BHPS wave 1 to 14.2 trajectories were selected:
People in labour force (1= in LF)People with limiting illness in the previous 12
months (0 = no limiting illness)Reference sequences were being in labour force
for all waves; and no limiting illness for all waves
Hypothesis was people who are ill will not be working
Research question is how trajectories of labour force participation vary with in a given pattern of illness?
Leeds, 22 March 2007
Methods
Data restricted to all those who had information on both variables in all waves.
STATA commands used to match and produce the standardised distances against reference sequences.
As there are only 14 waves, there were only 14 discrete values for distances (small enough to look at each value separately but large enough for treating as continuous)
Leeds, 22 March 2007
Method to describe a pattern graphically
Traditionally, area charts are used to describe patterns.
Disadvantage: uncertainty at each time point in the pattern is not reflected.
Information content of a distribution of states at a time point can be calculated using Shannon’s information measure.
Information at a position, R =
Where a is state (0,1) and fa is frequency of state a.This method is based on ‘sequence logos’ used to
describe genetic sequences Schneider, 1999
aa
a ff 2
1
0
log2
Leeds, 22 March 2007
Method to analyse variations in patterns
To study variations in distribution of patterns I used the Gini coefficients.
The Gini coefficient can be decomposed as between groups, within groups and overlap. It has no distributional assumptions except for the variable should be monotonically increasing.
Similarity of this procedure with ANOVA has lead to it being called ANoGi (Frick et al. 2004)
Leeds, 22 March 2007
Results
Sample size: 4796Sequence with only one state:
Limiting illness: 2924 (60.9%)In labour force: 2477 (51.7%)
Number of episodes:
Episodes N % N %1 2,924 61.0 2,477 51.72 354 7.4 1,019 21.33 747 15.6 640 13.34 263 5.5 339 7.15 259 5.4 183 3.86 119 2.5 81 1.7
>6 130 2.7 138 2.9
In labour forceLimiting illness
Leeds, 22 March 2007
Limiting illness: distribution of patterns according to distance from reference pattern (No illness in all waves)
Distance Number %
0.000 2,821 58.82
0.143 524 10.93
0.286 319 6.65
0.429 191 3.98
0.571 167 3.48
0.714 126 2.63
0.857 80 1.67
1.000 84 1.75
1.143 58 1.21
1.286 73 1.52
1.429 60 1.25
1.571 67 1.4
1.714 56 1.17
1.857 67 1.4
2.000 103 2.15
Leeds, 22 March 2007
Limiting illness patterns at distance 1: Traditional graphic representation
0
20
40
60
80
0 5 10 15
No limitLimit
Leeds, 22 March 2007
Limiting illness patterns at distance 1: Information theoretic (sequence logo) representation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Waves
Cer
tain
ty
L limiting illness N No limitations
Leeds, 22 March 2007
Pattern for labour force participation : whole sample
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Waves
Cer
tain
ty
E In labour force N Not in labour force
Leeds, 22 March 2007
Pattern for labour force participation : in those with no limiting illness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Waves
Cer
tain
ty
E In labour force N Not in labour force
Leeds, 22 March 2007
Pattern for labour force participation : in those with limiting illness in half the waves
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Waves
Cer
tain
ty
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Waves
Cer
tain
ty
E In labour force N Not in labour force
Leeds, 22 March 2007
Pattern for labour force participation : in those with limiting illness in all waves
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Waves
Cer
tain
ty
E In labour force N Not in labour force
Leeds, 22 March 2007
Analysis of Gini: Patterns of employment grouped by patterns of limiting ill health
Gini coefficient %Between group 0.22 60.0Within groups 0.11 30.5Overlap 0.03 9.5Overall 0.36 100.0
Leeds, 22 March 2007
Relationship of patterns of labour force participation and patterns of limiting illness
r=0.43
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0L 1L 2L 3L 4L 5L 6L 7L 8L 9L 10L 11L 12L 13L 14L
Patterns of health sequence
Gin
i co
effi
cien
t o
f p
atte
rns
of
emp
loym
ent
Leeds, 22 March 2007
In conclusion…
Work in progress.Need to explore using more complex patterns
and full optimal matchingOther methods
Fellowship mentored by:Professor David Blane, Imperial CollegeProfessor Mel Bartley, UCLProfessor Richard Wiggins, City UniversityProfessor Nicky Best, Imperial College