52
Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD [email protected] Design, Biostatistics & Population Studies Georgetown-Howard Center Clinical Translational Science 1

Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD [email protected] Design, Biostatistics & Population Studies

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Analysis of Large“Population-based” Databases

for Clinical Research

John Kwagyan, PhD

[email protected]

Design, Biostatistics & Population StudiesGeorgetown-Howard Center Clinical Translational Science

1

Page 2: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

………… …………

That we are in the midst of crisis is now well understood. Our nation is at war,…………. Our economy is badly weakened, ……….. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many;..

These are the indicators of crisis, subject to data and statistics.

………………………………

Pres. Barack Obama (Inaugural speech)

Page 3: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Sequence of Steps in a Research Project

• Conceptualization

• Planning/Design

• Execution

• Interpretation

• Reporting

- Abstracts, Presentation, Publication

3

Data Collection & Processing

Data Analysis

Page 4: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Outline

4

• Types, Uses & Opportunities

• National & Institutional Databases

• Access

• Analysis & Statistical Issues

Page 5: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Types, Uses & Opportunities

5

Page 6: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Types of Large Databases

6

• (Health) Survey Databases

NHANES

• (Health) Administrative Databases

HCUP

Discharge & Mortality Databases

Specialty Databases- e.g. stroke

• Clinical trials

AASK, ALLHAT

Page 7: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Uses of Large Databases

• Secondary Analysis

~ publications

• Pilot Data for grant proposals

• Power Exploration

• Hypothesis Generation & Testing

• Estimate of Summary Statistics

-prevalence, incidence, mortality, etc

7

Page 8: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Advantages using large databases

8

• Large Sample• Fast & Easily (Some) Accessible• Provide population Estimates• Can test trend over time • Observational, cross-sectional, longitudinal

Page 9: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Limitations & Challenges

9

• Non-Experimental: (Survey & Administrative) • Most are cross sectional• May require special skills -special statistical techniques & software usage• Statistical Issues to address• May involve long bureaucracy -Written request or proposal - IRB approval• May cost a fee & travel

Page 10: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Funding Opportunities Secondary Analysis

R03, R21 mechanisms----• Obtain data collected by the parent study or by Ancillary Studies to prepare a scientific manuscript for publication on a topic (aims) that has not yet been addressed.

• Receive limited preliminary study data summaries, to prepare a proposal for funding of secondary analyses of data .

• Obtain specimens (e.g. blood, urine, imaging scans) for new assays or analyses to be conducted using an outside funding source.

10

Page 11: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

11

Page 12: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

12

Nces.ed.org/nationsreportcard/researchcenter/funding.asp

Page 13: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

13

Funding Opportunities

Page 14: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

National Databases

14

Page 15: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

National Health & Nutrition Examination Survey (NHANES): www.cdc.org/nchs/nhanes.html

15

• Population : Adult & Children

• Method: Face-Face Interview, Physical Exams

• Content: Anthropometry, Respiratory disease, chronic & infectious disease, mental health & cognitive functioning, reproductive history & sexual behavior

• Data: N~5000/yr since 1999; Initiated in 1960

• Notes: Supplemental food survey, online tutorial

Page 16: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

National Health Interview Survey (NHIS) : www.cdc.org/nchs/nhis.html

16

• Population : Household (Families) Adult & Children

• Method: Face-Face Interview, Physical Exams

• Content: Health conditions & behaviors, access to & use of health services; Genetic testing,

• Data: N ~35,000 Households (~87,500 persons) Initiated in 1957

• Notes: Data used widely by the DHHS to monitor trends in illness and disability and to track progress toward achieving national health objectives.

Page 17: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Surveillance Epidemiology and End Results (SEER): http://seer.cancer.org

17

• Population : Children to Adult

• Method: Data collected from cancer registries that cover ~28% of the US population; follow-up with individual cases until death

• Content: Cancer incidence, prevalence, and survival data; limited demographics (age, race/ethnicity, region)

• Data: Cancer cases in registries, >6Million cases

• Notes: Need specialized software to analyze (SEER*Stat or SEER*Prep) downloaded from website; Must sign user agreement to obtain.

Page 18: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Healthcare Cost & Utilization Project (HCUP) http://www.ahrq.org/data/hcup

18

• Population : All ages

• Method: A family of healthcare databases and tools

• Content: Databases enable research on a broad range of health policy issues, including cost and quality of health services, medical practice patterns, access to health care programs, and outcomes of treatments.

• Data: Cancer cases in registries, • Notes: Databases are available for purchase through a

central distributor

Page 19: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

African America Study of Kidney Disease & Hypertension(AASK):www.niddkrepository.org/

19

• Population : Adult African Americans, 18-70 years

• Method: Participants followed for 2years to measure the long-term effects of blood pressure control in patients with kidney disease attributed to high blood pressure.

• Content: BP, markers of kidney function

• Data: 1094 • Notes: Largest and longest study of chronic kidney disease

in African Americans

Page 20: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

CDC Wonder wonder.cdc.org

20

• Wide-ranging Online Health related Datasets for Epidemiologic Research

• Each data set can be queried using a series of menus

• Provides an online tool for retrieving and analyzing data

Page 21: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

CDC Wonder

21

Page 22: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Institutional (GHUCCTS) Databases

22

Page 23: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

23

• Obesity Project - HU • Family Genetics Study of Prostate Cancer-HU• HIV in DC – HU• Memory Disorder Study - HU• Spinal Cord Disease Database - MRI• Stroke Database - MRI/NRH• Brain Injury Database- MRI/NRH• National Capital Spinal Cord Injury Model System – MRI/NRH• Strong Heart Study- MRI• The VA Decision Support System Database (DSS) – VA• ……..• ………

Institutional (GHUCCTS) Databases

Page 24: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Access/Retrieval

24

Page 25: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Data Access/Retrieval

• May require special request or proposals

- aims, etc

-preparation of detailed analysis plans• Understand the database structure• Extraction of requisite data for specific objectives• Application of appropriate linkage techniques for

multiple data sources• Process & Storage

25

Page 26: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Database Structure

26

• Relational Structure: (1-to-1)

represented by a table of rows & columns ~ attributes are listed in columns

ID, AGE, GENDER, …..

~unique identifiers

• Hierarchical (Nested) Structure: (1-to-many)

allows for multiplicity of attributes whiles preserving relationships

Page 27: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Data Structure

27

RELATIONAL

PID Age genderdisease_status100 45 Male 0101 56 female 1102 67 female 0

HIERARCHICAL/NESTEDWARD PID WARD FAMID PID

1 100 1 1 1001 101 1 1 1011 102 1 2 1002 100 1 2 1012 101 2 1 1002 102 2 1 101

2 2 1002 2 101

Page 28: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Data Analysis Methods

28

Page 29: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Types of Data Endpoints

• Continuous Data - BP, BMI, TC, LDL, HDL, Blood Sugar

• Categorical Data - Hypertension, Obese, Dyslipidemia, Diabetes

• Count Data 0, 1, 2, 3

• Survival (Time-to-Event) Data - time-to-cardiac event, time-to-death

29

Page 30: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Partition Data Into Subsets

Core partitioning ~ arises naturally• Race• Gender• Age Group• Geographic Region

Time partitioning• 2000-2010• 1995-2000; 2000-05

30

Page 31: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Descriptive AnalysisBy Partition

31

Measures of Central Tendency Means, Median, Mode, etc Rates – Prevalence, Incidence, Survival, Mortality Variability SD, range, IQR

Page 32: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Visualization MethodsExploratory Analysis

32

Apply visualization methods by subsets

Charts Scatter Plot matrix ~ continuous measures Trellis plot ~ all measures

Page 33: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

33

Page 34: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Trellis Plot

34

Page 35: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Inference Statistical Tests

35

The method used depends on

1. Outcome measure Univariate Multivariate 2. Study design

Page 36: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Continuous Data

Parametric Tests• Paired T-tests ~ non-

comparative open-label studies (pre-post studies)

• Two Sample T-test ~ comparative studies (eg. parallel-group designs )

• ANOVA (F-Test) ~ comparing multiple groups (eg, parallel-groups designs, factorial designs)

Non-Parametric Equivalent• Wilcoxon Signed Rank

Test

• Wilcoxon Rank Sum Test

• Kruskal-Wallis Test

36

Page 37: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Categorical Data

What is the question?

Compare rates:

prevalence, incidence, mortality!

• Chi-square Test• McNemar Test (pre-post designs)• Mantel-Haenzel test- heterogeneity

37

Page 38: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Survival Data

Question? Compare survival rates!

Survival curves, hazard ratios

• Kaplan-Meier Estimator

• Log- Rank Test

• Likelihood Ratio Test

38

Page 39: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Regression Methods

• used when it is necessary to adjust for different covariate/confounding effects

39

Cholesterol level ~ gender, age, diet

Page 40: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Regression Methods

• Continuous Data

~ Linear Regression Models

• Categorical Data

~ Logistic Regression Models

~ Conditional Regression Models

• Survival Data

~ Proportional Hazard Regression

40

Page 41: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Multi-Level Models Hierarchical (Nested) Models

• Multilevel Regression

• Mixed Effect Models

• Nested Models

-GEE

-Proc Nested• Bayesian Approaches

41

Page 42: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Multivariable Methods

TC ~ gender, age, diet

42

[HDL, LDL, TG] ~ gender, age, diet

Use to analyze multiple outcomes jointly

univariate

Multivariable

Risk factors

Page 43: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Multivariable Methods

• MANOVA

• Discriminant Analysis

• Factor Analysis

• Cluster Analysis

• Principal Component Analysis

43

Page 44: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Statistical Issues

44

Page 45: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Statistical Issues

45

• Sampling error• Missing data• high likelihood of finding a significant difference due to chance alone • Potential for bias result is substantial

Page 46: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Recommendations for Health Survey Data

46

• Statistical weights • Stratification• Clustering• Variance Estimation

Use ~

Page 47: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Use of Statistical Weights

47

• The statistical weight of a sampled person is the number of people in the population that the person represents. • If sampling rate is 1/1000 Each sampled person represents 1000 people Each sampled person would have a sample weight of 1000• Weights derived from selection probabilities response rates post-stratification adjustments (e.g. gender, education, etc)

Page 48: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Stratification

48

• Population divided before sampling into disjoint, exhaustive groups (strata) Members termed primary sampling units (PSUs) Independent samples are taken in each strata

• Strata formed by similar demographic areas

Page 49: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Clustering Hierarchical (Nested) Data

49

• Persons residing in a small area (cluster) may have similar characteristics• Responses of subjects in clusters may be correlated • Dependence between subjects leads to inflate variance• Correlation must be accounted for in the analysis

Page 50: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Variance Estimation

50

Use appropriate variance estimation methods:

Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators Default method for most stats programs

Replication methods: Calculates different parameter estimates for each replicate and combines these to estimate variance. Jackknife, etc

Page 51: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Summary

• Fast and easily accessible• Provides several uses and opportunities• Large databases will continue to provide important findings

for clinical research• Mindful of statistical issues• Use weighting, clustering or stratification when appropriate

51

Page 52: Analysis of Large “Population-based” Databases for Clinical Research John Kwagyan, PhD jkwagyan@howard.edu Design, Biostatistics & Population Studies

Thank you

52