Using SAS® to Calculate Incidence and Prevalence Rates in ... · 1 Using SAS® to Calculate Incidence and Prevalence Rates in a Dynamic Population Li-Hao Chu, Kaiser Permanente,

1

Using SAS® to Calculate Incidence and Prevalence Rates in a Dynamic Population

Li-Hao Chu, Kaiser Permanente, Pasadena, CA

Fagen Xie, Kaiser Permanente, Pasadena, CA

ABSTRACT

A typical epidemiological study on chronic diseases often starts with the measurement of incidence rate (IR) and prevalence rate (PR). As the use of electronic medical records and administrative databases become popular in health care management, the need to translate data into valuable clinical knowledge to better monitor the spread of diseases and to ensure patient safety becomes very important.

This report summarizes the definition and the key elements needed for measuring IR and PR in a managed care setting. Besides crude rates, as a result of the variation in member characteristics in different managed care organizations, adjusted rate and adjusted rate ratio are often required to allow comparisons across different populations. A demonstration on how to derive adjusted rate and adjusted rate ratio using the ESTIMATE and LSMEANS statements is provided in this paper.

INTRODUCTION

Incidence rate (IR) and prevalence rate (PR) are two measures widely applied to assess disease risk and occurrence in epidemiological studies [1-4]. Incidence rate takes the number of newly initiated disease cases divided by people at risk in a defined period of time while the definition of prevalence is the proportion of individuals in a population who have a specific characteristic or disease at a specific point in time or in a specified time interval. These two measures serve different purposes. The IR allows us to assess the etiology of a disease and explore the relationship between risk factors and the disease. The PR shows how widespread and stable a condition is over time, and provides information on the burden of disease.

As electronic medical records and administrative databases are gaining popularity in health care management, the need to translate data into valuable clinical knowledge to better monitor the spread of diseases and to ensure patient safety becomes very important. This paper aims to demonstrate the process of constructing IR and PR in a managed care setting which has a dynamic population with members in and out of the health plan intermittently. Moreover, as a result of the variation in member characteristics in different managed care organizations, adjusted measures taking into account member demographic information like age, gender and race/ethnicity are usually reported along with the crude rates. The programming technique and background for calculating crude rate, adjusted rate and adjusted rate ratio are also introduced in this report.

INCIDENCE RATE AND PREVALENCE RATE

Table 1 shows definitions and formulas for different types of IR and PR. Incidence rate can be measured in the format of a fraction like cumulative incidence (CI) or in the format of a rate like incidence density (ID). In both situations, the numerator only includes cases with newly initiated disease. For the denominator, besides including subjects at risk of developing the specified condition, the denominator in ID takes into account the total exposure time [1].

Prevalence rate refers to a fraction of prevalent cases over the total population at a specified point in time. It is like taking a snapshot during the study period, and it can be calculated at the beginning, in the middle or at the end of the study. It is unitless and there is no need to consider the exposure time [1].

Numerators for Incidence and Prevalence

To demonstrate the calculation of IR and PR, rheumatoid arthritis (RA), a chronic inflammatory disease of the joints that requires long-term treatment to prevent the progression of the disease, is used as an example here. The prevalence of RA in the United States was estimated to be around 0.5% to 1% in the general population [2]. A national standard was developed and validated to identify RA patients in medical claims by HEDIS [3]. A RA patient is defined by having at least two face to face encounters with different dates of service in an outpatient or non-acute inpatient setting with any diagnosis of rheumatoid arthritis. Any adult patient between age 18 and 100 years old that meets the criteria is counted as a RA prevalent case. However, to qualify as an incident RA case, besides meeting the criteria, a washout period which is one year continuous eligibility without any RA diagnosis prior to the first encounter of RA is required.

2

Incidence Rate (IR) Prevalence Rate (PR)

A measure of new cases of disease in a defined population over a specified period of time. It is the number of new cases of disease per the number of people at risk in a defined period of time of observation.

The proportion of individuals in a population who have a specific characteristic or disease at a specific point in time or in a specified time interval.

This measure provides a direct estimate of the probability or risk of illness.

This measure tells you how widespread and stable a condition is over time.

Cumulative Incidence (CI):

Point prevalence:

Incidence density (ID):

Period Prevalence:

Table 1. Summary of Incidence Rate and Prevalence Rate

Denominator for Incidence and Prevalence

Since a member can be in and out of a health plan at different points in time, membership enrollment history is structured as multiple records per member with different start and stop dates. It is also common to have a short gap like 30 to 60 days or an overlapping period between two records as a result of contract renewal or extension. Therefore, the first step is to bridge these gaps and remove the overlapping period. A SAS MACRO named DCM_ENROLL_COMBINE developed by the department of research and evaluation in Kaiser Permanente Southern California is introduced here to carry out this task (appendix 1). The process of creating denominators is illustrated in Figure 1, and it also helps users to visualize the structure of member enrollment history in database.

Figure 1. Illustration of the Process to Create Denominator for IR and PR

SAS MACRO: IRPR_DENO.SAS

Exposure Time

Prevalent Case Count

SAS MACRO: DCM_ENROLL_COMBINE.SAS

3

Adjusted Rate and Adjusted Rate Ratio

Crude IR and PR provide an overview of the risk of developing the disease and the disease distribution in a health

plan care setting respectively. However, in order to make the result comparable to other organizations or

generalizable to the public, it is necessary to adjust the rate on the available demographic information. For outcomes

like incident cases or prevalent cases can be counted, regression models for multivariate count data like Poisson

regression could be an appropriate analytic choice.

When considering IR and PR, both of them can be seen as a rate like two new RA patients per 1000 member years

or 100 RA patients per 100,000 members. To estimate an event rate or rate ratio like IR or PR, a Poisson or negative

binomial model can be applied [4]. Prior to modeling, running a Poisson regression to examine the dispersion of data

through deviance and Pearson chi-square is recommended. When the deviance is greater than one, which means

the variance is greater than mean, over-dispersion exists. When the deviance is smaller than one, which means the

variance is smaller than mean, under-dispersion exists. In either case, a Poisson model is not an appropriate choice.

A negative binomial model which allows modeling heterogeneity in a population is a good candidate for over-

dispersed or under-dispersed data.

PRESENTATION OF CODE

Step 1: Bridge Gap and Eliminate Overlapping Periods

To use the SAS MACRO DCM_ENROLL_COMBINE (appendix 1), one has to specify a few parameters. INDATA is the name of source data, and it needs to include a study subject ID (MBRID in this case), membership starting date (STRT) and membership stopping date (STOP). GAP is the length of a gap in membership that will be allowed to be considered as continuous membership. OUTDATA is the name of output dataset.

%include 'location of macro';

%dcm_enroll_combine(mbrid=mrn, strt=strt, stop=stop, indata=memb_enru_hist, outdata=tx1.mbr_enrl, gap=45);

Step 2: Bring in Demographic Variables

Next, we add the demographic characteristics race, age and gender to the data.

/**vw_members is the source table of member demographic info**/

/**strt is the start date of enrollment**/

/**stop is the end date of enrollment**/

proc sql;

create table mbr_dob

as select a.mrn, datepart(a.strt) as start_dt format=mmddyy10.,

datepart(a.stop) as end_dt format=mmddyy10.

,datepart(b.dob) as dob format=mmddyy10.

,b.gender, b.race

from tx1.mbr_enrl as a

inner join vw_members as b

on a.mrn = b.mrn

where gender in ('f', 'm');

;

quit;

Step 3: Calculate Person Year and Eligible Member Count at Member-Year Level

The SAS MACRO IRPR_DENO allows users to calculate person-year in a specified time period and at the same time takes a snapshot in the middle of each year to define an eligible member, which will be used as the denominator for PR. In our case the study period is from 2004 to 2010. Five parameters are required in this MACRO. DOB is the date of birth of each member. FIRSTYR is the starting year of the study. LASTYR is the end year of the study. DATAIN is the source file that has member demographic information, and DATAOUT is the output file name.

%macro irpr_deno(dob=dob, firstyr=, lastyr=, datain=, dataout=);

data deno1;

set &datain;

%do i=&firstyr %to &lastyr;

4

year=&i;

x0=mdy(7,1,&i); /**snap-shot point for prevelence**/

x1=mdy(1,1,&i);

x2=mdy(12,31,&i);

age=int((x0 - &dob)/365.25);

/**calculate person year**/

p_year=yrdif(max(start_dt,x1), min(end_dt,x2),'actual');

/***prevalence count****/

if (start_dt <= x0 <=end_dt) then count=1; else count=0;

if p_year>0 then output;

%end;

drop x:;

run;

/**to group multiple records per member to one record per member**/

proc sql;

create table &dataout

as select distinct mrn, dob, age, gender, year, sum(count) as cnt,

sum(p_year) as pyear, race, hisp

from deno1

group by mrn, year;

quit;

%mend;

%irpr_deno(dob=dob, firstyr=2004, lastyr=2010, datain=mbr_dob, dataout=mbr_pyr);

Step 4: Preparing Numerator Files

Since the logic to define a specific disease population in the numerator greatly depends on the study, the following program is only for the demonstration of how to prepare the numerator file to join with a denominator file. Two numerator files, INC_NUM and PRE_NUM, are created for RA incidence and prevalence rate respectively. PRE_NUM must include member demographic information like age, gender and race/ethnicity.

/**create new id as the combination of member id and diagnosis year**/

data inc_num_yrid;

set inc_num;

length yearid $20;

yearid=compress(member_id||trim(put(year, $4.)));

run;

/**create format based off the new id**/

proc sql;

create table year_id

as select distinct yearid as start, "*" as label, "$yrid" as fmtname

from inc_num_yrid;

quit;

proc format library=work cntlin=year_id; run;

/**create member year level file for pre_num file**/

data pre_case;

set pre_num;

do i=2004 to 2010;

cntyear=i;

x0=mdy(7,1,i);

/**dx_dt: the date of ra diagnosis**/

if dx_dt<= x0 <= stop_dt then pre=1;

else pre=0;

output;

end;

drop x1;

run;

5

/**using the year id format to flag the incidence case**/

data inc_pre_case;

set pre_case;

length yearid $20;

yearid=compress(member_id||trim(put(cntyear, $4.)));

if put(yearid, $yrid.)="*" then inc=1; else inc=0;

run;

Step 4: Join Numerator and Denominator Files and Create Summary file

Remember when the denominator files was constructed, the total exposure time (person-year) was completely based on the start and end date of member enrollment history. For those who have developed RA, they were no longer at risk of developing RA. In other words, the person-year can only be counted until RA was diagnosed. After that, person-year should be defaulted to zero. To fix this, person-year in the numerator file of incidence was also calculated and was used to replace the person-year in denominator after two files are joined. The following program demonstrates how to achieve this task. At the end of this step, the final data set will be summarized by age group, gender and race. Here I have created the formats for age group and race before submitting the program.

/*join denominator file and numerator file*/

proc sql;

create table deno_num

as select a.*, b.pyear as pyear_num, b.pre, b.inc

from mbr_pyr as a

left join inc_pre_case as b

on a.mrn= b.mrn

and a.year= b.year;

quit;

data deno_num_2 (where=(18 <= age <=100));

set deno_num;

length agegp sex$8 race_gp $25;

/*replace denominator in incident case*/

/*inc is from numerator file and has three values. 1: incidence; 0: post-incidence; .: prevalent case*/

if inc=1 then do;

inc_pyear=pyear_num; /*for incidence density*/

incnt=0; /*for incidence rate*/

end;

else if inc=0 then do;

inc_pyear=0;

incnt=0;

end;

else do;

inc_pyear=pyear;

incnt=1; /*member at risk of disease*/

inc=0;

end;

if pre=. then pre=0; /*numerator for prevalence*/

/**create format first**/

agegp = put(age, agegp.);

race_gp=put(race, $newrace.);

run;

/***summarize the denominator for prevalence and incidence***/

proc summary data=deno_num_2 nway missing;

6

class year sex agegp race_gp;

var pyear inc_pyear count incnt pre inc;

output out=ra.ra_rate (drop=_type_ _freq_)

sum=sum_pyear sum_incpyear sum_cnt sum_icnt num_pre num_inc;

run;

Step 5: Estimate Adjusted Rate and Rate Ratio

To complete this step, we must first examine the dispersion of data, and then decide on the type of distribution to use in model statement. If over-dispersion or under-dispersion is found, a negative binomial model is preferred. In the GENMOD procedure, the LSMEANS statement can linearly combine the fixed effects for parameters used in the model and allow us to estimate the response mean for values of the categorical variables [5]. It is used to derive adjusted rate in our example. The ESTIMATE statement is used to estimate the ratio of the two rates.

/*test to determine the distribution type*/

/*note: if deviance or pearson chi-square divided by the degrees of freedom is >1, set dist=nb*/

data rate;

set ra.ra_rate;

logpyr=log(sum_pyear);

incnt = num_inc*1000;

run;

proc genmod data=rate;

class age_gp race year;

model incnt = age_gp race year

/offset=logpyr dist=poisson /*nb*/

link=log type3;

run;

proc genmod data=rate;

title 'adjusted rate with negative binomial distribution';

class age_gp race year;

model incnt = age_gp race year/offset=logpyr dist=nb link=log type3;

/*estimate the adjusted rate*/

lsmeans age_gp race year/ilink cl means;

/*estimate the adjusted rate ratio in age group*/

estimate '35-44 vs. 18-34' age_gp 1;

estimate '45-54 vs. 18-34' age_gp 0 1;

estimate '55-64 vs. 18-34' age_gp 0 0 1 ;

estimate '65+ vs. 18-34' age_gp 0 0 0 1 ;

run;

CONCLUSION

Our paper demonstrated the logic and the SAS programming technique to calculate crude rates, adjusted rates and adjusted rate ratio in a dynamic population of a managed care organization. By sharing this experience, in addition to establishing a standard method, the ultimate goal is to provide an accurate measurement of IR and PR, which can benefit our public health in providing guidance for which diseases are most popular and what health problems deserve priority.

REFERENCES

1. Rothman, K.J., Epidemiology: An introduction. 2002. 2. Alamanos, Y., P.V. Voulgari, and A.A. Drosos, Incidence and prevalence of rheumatoid arthritis, based on

the 1987 American College of Rheumatology criteria: a systematic review. Seminars in arthritis and rheumatism, 2006. 36(3): p. 182-8.

3. HEDIS 2011 Measures, 2011, National Committee for Quality Assurance.

7

4. Peter McCullagh, J.A.N., Generalized Linear Models. Second ed1989: Longon: Chapman and Hall. 5. Kathleen Kiernan, R.T., Phil Gibbs, and Jill Tao, CONTRAST and ESTIMATE Statements Made Easy: The

LSMESTIMATE Statement. SAS Global Forum 2011, 2011.

ACKNOWLEDGMENTS

The author would like to thank Aniket A. Kawatakar for his support, Wansu Chen and Jeff M Slezak for their helpful

discussions, Brenda Beaty for her review and Fagen Xie for his contribution of the DCM_ENROLL_COMBINE SAS

MACRO.

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author at:

Name: Li-Hao Chu Enterprise: Kaiser Permanente Southern California Address: 100 S Los Robles, Pasadena, CA 91101 Work Phone: (626)564-3980 Fax: (626)564-3403 E-mail: [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

8

APPENDIX 1

%macro dcm_enroll_combine (mbrid=, strt=, stop=, indata=,outdata=,gap=);

proc sort data=&indata(keep=&mbrid &strt &stop) out=tmp nodupkey;

by &mbrid &strt &stop;

where &strt ne . or &stop ne .;

run;

data &outdata (keep=&mbrid strt_f stop_f

rename=(strt_f=&strt stop_f=&stop));

retain prev_stop strt_f stop_f;

set tmp;

by &mbrid;

if first.&mbrid. then do;

strt_f=&strt;

stop_f=&stop;

prev_stop=&stop;

end;

if (&strt>(prev_stop+&gap+1)) then do;

stop_f=prev_stop;

output;

strt_f=&strt;

end;

if (&stop > prev_stop) then prev_stop=&stop;

if last.&mbrid. then do;

stop_f=prev_stop;

output;

end;

format strt_f stop_f yymmdd10.;

run;

%mend dcm_enroll_combine;

Documents

Using SAS® to Calculate Incidence and Prevalence Rates in ... · 1 Using SAS® to Calculate Incidence and Prevalence Rates in a Dynamic Population Li-Hao Chu, Kaiser Permanente,