Download ppt - Leeds, 22 March 2007 A SECONDARY ANALYSIS OF DATA MID CAREER FELLOWSHIP Gopalakrishnan Netuveli Imperial College London 1 Jan 2007 – 31 March 2008

Leeds, 22 March 2007

A SECONDARY ANALYSIS OF DATA MID CAREER FELLOWSHIP

Gopalakrishnan NetuveliImperial College London1 Jan 2007 – 31 March 2008


Treating longitudinal data longitudinally

The objective of this fellowship is to gain experience and proficiency in the secondary analysis of longitudinal data sets.

Motivation Large amount of resources spent on

collecting longitudinal data Inadequate utilisation of the

“longitudinality” of the data


The substantive research question

The objectives of the fellowship will be pursued by investigating the inter-relationship of trajectories of employment status and health in British Household Panel Survey.

English Longitudinal Study of Ageing will also be used.

In this presentation I present results from the first two months of work done.


Complexity of longitudinal data

Structure of longitudinal data - like an arrow in flight, which ‘is paradoxically at one stage while also pursuing its path to the target’.

Challenge - resolve that paradox by utilising the whole information included in the stage and the track of the longitudinal variable.


Importance of trajectories

In the data, trajectories are represented as an array of time indexed variables.

Every point in the trajectory contains information about the current value at the point of measurement the direction

how that point was reached where it might go.

Longitudinal data are underused if only the magnitude of the time- indexed variables are used without taking into account the possible interrelationship between them.


Examples of research meeting the challenge

Mainland European demographers, used linked information from Norwegian national decennial censuses to construct social trajectories through the life course.

They used state-order-time models to predict mortality:

State model:

logit P(Y = 1) = d + Σ(aiAi + biBi + ciCi)

Order model: logit P(Y = 1) = a + ΣbiOi

-Wunsch et al. 1996


Combining states and orders: sequences

Order of states can be expressed as sequences, which can represent longitudinal trajectories.

Sequences are clustered according to their similarity in all to all comparisons or in comparisons against ideal types.

Cluster membership is used as dependent or independent variable in analyses.

A method becoming popular for this purpose is optimal matching


Optimal matching: a short primer

A measure of dissimilarity between two sequences is

d = (L1 + L2) – 2*LCS

Where L1 and L2 are total lengths of first and second sequences and LCS, the length of the largest common sequence they share.

e.g. LONDON LEEDS L1 = 6; L2 = 5; LCS (LD) = 2; d= 11-4 = 7 The matching is optimal – when d has the

smallest possible value, which depend on identifying highest possible LCS


A special case

If sequences are made of two states and of equal length (L):1. AAAAAA = 111111 Σ1=6 =L2. ABAABA = 101101 Σ2=43. BBAABB = 001100 Σ3=2

Σ is the LCS

d1→2 = (6+6)-2*4 = 4 = 2*(L-Σ2)d1→3 = (6+6)-2*2 = 8 = 2*(L-Σ3)

d2→3 cannot be extrapolated from these relations

d is often standardised by dividing with L (or longer length in case of unequal lengths)


Developing methods to compare trajectories of two different variables: progress to-date

I used BHPS wave 1 to 14.2 trajectories were selected:

People in labour force (1= in LF)People with limiting illness in the previous 12

months (0 = no limiting illness)Reference sequences were being in labour force

for all waves; and no limiting illness for all waves

Hypothesis was people who are ill will not be working

Research question is how trajectories of labour force participation vary with in a given pattern of illness?


Methods

Data restricted to all those who had information on both variables in all waves.

STATA commands used to match and produce the standardised distances against reference sequences.

As there are only 14 waves, there were only 14 discrete values for distances (small enough to look at each value separately but large enough for treating as continuous)


Method to describe a pattern graphically

Traditionally, area charts are used to describe patterns.

Disadvantage: uncertainty at each time point in the pattern is not reflected.

Information content of a distribution of states at a time point can be calculated using Shannon’s information measure.

Information at a position, R =

Where a is state (0,1) and fa is frequency of state a.This method is based on ‘sequence logos’ used to

describe genetic sequences Schneider, 1999

aa

a ff 2

1

0

log2


Method to analyse variations in patterns

To study variations in distribution of patterns I used the Gini coefficients.

The Gini coefficient can be decomposed as between groups, within groups and overlap. It has no distributional assumptions except for the variable should be monotonically increasing.

Similarity of this procedure with ANOVA has lead to it being called ANoGi (Frick et al. 2004)


Results

Sample size: 4796Sequence with only one state:

Limiting illness: 2924 (60.9%)In labour force: 2477 (51.7%)

Number of episodes:

Episodes N % N %1 2,924 61.0 2,477 51.72 354 7.4 1,019 21.33 747 15.6 640 13.34 263 5.5 339 7.15 259 5.4 183 3.86 119 2.5 81 1.7

>6 130 2.7 138 2.9

In labour forceLimiting illness


Limiting illness: distribution of patterns according to distance from reference pattern (No illness in all waves)

Distance Number %

0.000 2,821 58.82

0.143 524 10.93

0.286 319 6.65

0.429 191 3.98

0.571 167 3.48

0.714 126 2.63

0.857 80 1.67

1.000 84 1.75

1.143 58 1.21

1.286 73 1.52

1.429 60 1.25

1.571 67 1.4

1.714 56 1.17

1.857 67 1.4

2.000 103 2.15


Limiting illness patterns at distance 1: Traditional graphic representation

0

20

40

60

80

0 5 10 15

No limitLimit


Limiting illness patterns at distance 1: Information theoretic (sequence logo) representation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Waves

Cer

tain

ty

L limiting illness N No limitations


Pattern for labour force participation : whole sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Waves

Cer

tain

ty

E In labour force N Not in labour force


Pattern for labour force participation : in those with no limiting illness

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Waves

Cer

tain

ty



Pattern for labour force participation : in those with limiting illness in half the waves

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Waves

Cer

tain

ty

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Waves

Cer

tain

ty



Pattern for labour force participation : in those with limiting illness in all waves

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Waves

Cer

tain

ty



Analysis of Gini: Patterns of employment grouped by patterns of limiting ill health

Gini coefficient %Between group 0.22 60.0Within groups 0.11 30.5Overlap 0.03 9.5Overall 0.36 100.0


Relationship of patterns of labour force participation and patterns of limiting illness

r=0.43

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0L 1L 2L 3L 4L 5L 6L 7L 8L 9L 10L 11L 12L 13L 14L

Patterns of health sequence

Gin

i co

effi

cien

t o

f p

atte

rns

of

emp

loym

ent


In conclusion…

Work in progress.Need to explore using more complex patterns

and full optimal matchingOther methods

Fellowship mentored by:Professor David Blane, Imperial CollegeProfessor Mel Bartley, UCLProfessor Richard Wiggins, City UniversityProfessor Nicky Best, Imperial College